Sharing some programming knowledge.

0%

Pandas:Visualization Using Matplotlib and Pandas

In Python, there are many plotting libraries like matplotlib, pandas visualization, seaborn or plotly. Among them, matplotlib and pandas visualization are bases and they are the most common way to plot basic graphs. In this article, I would like to summarise the syntax of using them and try to offer a clear explaination in a plain way.

Preparation

1
2
3
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
1
2
3
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"
data = pd.read_csv(url, sep='\t')
data.head()

order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98

Plotting with matplotlib

Matplotlib is the most popular python plotting library, which is the basis of pandas visualization. It offers more freedom at the cost of having to write more code and has more complex syntax.

Scatter Plot

1
2
3
data['item_price'] = data['item_price'].apply(lambda x: float(x[1:-1]))
scatter_data = data.groupby('order_id').sum()
scatter_data.head()

quantity item_price
order_id
1 4 11.56
2 2 16.98
3 2 12.67
4 2 21.00
5 2 13.70
1
2
3
4
5
fg, ax = plt.subplots()
ax.scatter(x=scatter_data.quantity, y=scatter_data.item_price)
ax.set_title("the number of items orderered per order price")
ax.set_xlabel("order quantity")
ax.set_ylabel("order price")
Text(0, 0.5, 'order price')

png

fg, ax = plt.subplots(): create a figure and axis

fg is a Figure object and ax is a AxesSubplot object. So what are they?

Figure is the top level container for all the plot elements.

fg is a container, you can think it as a picure frame. Remember that your actual graph is not fg(Figure object), instead, your actual graph is presented by ax(AxesSubplot object). The figure is the part around your graph, and a figure can contain several subplots. Here we only have one, that is ax. So don’t get confused about the original meaning of “axe”, in matplotlib it represents a graph in a figure.

ax.scatter(x=scatter_data.quantity, y=scatter_data.item_price): scatter the quantity agaist the price

Since ax is the “actual graph”, so we are going to do scattering on it. Use x=scatter_data.quantity and y=scatter_data.item_price to give the data to plot.

ax.set_title(), ax.set_xlabel(), ax.set_ylabel(): polish your graph

Their function is quite self-explaining.

Line Chart

1
line_data = scatter_data
1
2
3
4
5
6
7
fg, ax = plt.subplots()
for column in line_data.columns:
ax.plot(line_data.index, line_data[column], label = column)
ax.set_title("order data")
ax.set_xlabel("order id")
ax.legend()
# this graph may not be so meaningful ...
<matplotlib.legend.Legend at 0x7f68d3225290>

png

ax.plot(line_data.index, line_data[column], label = column): plot a line chart

We can plot multiple lines in one chart with different labels, which are “quantity” and “item_price” in this example. You just have to use a loop.

ax.legend():

ax.legend() is to help you show the “label” corresponding to each data (i.e. each line in this example), so that readers can better understand your data structure.

Histogram

Histogram is a chart specially used for representing a frequency distribution; heights of the bars represent observed frequencies. The purpose of histogram is to roughly assess the probability distribution of a given variable by depicting the frequencies of observations occurring in certain ranges of values.

1
2
3
item_list = ['Chicken Bowl','Chips','Steak Bowl','Canned Soda','Veggie Bowl']
hist_data = data[data['item_name'].isin(item_list)]
hist_data.head()

order_id quantity item_name choice_description item_price
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... 16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... 10.98
13 7 1 Chicken Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... 11.25
18 9 2 Canned Soda [Sprite] 2.18
19 10 1 Chicken Bowl [Tomatillo Red Chili Salsa, [Fajita Vegetables... 8.75
1
2
3
4
fg, ax = plt.subplots()
ax.hist(hist_data['item_name']);
ax.set_ylabel("frequency");
ax.set_xlabel("item");

png

ax.hist(hist_data[‘item_name’]):

Note that ax.hist() will automatically calculate how often each value occurs(i.e. frequecy) and put the results as y-axis values.

Bar Chart

The Bar Chart is useful for representing categorical data.

1
2
3
4
bar_data = hist_data
# bar chart won't automatically calculate the frequency, so we are going to use value_counts() to do this
x_data = bar_data['item_name'].value_counts().index
y_data = bar_data['item_name'].value_counts().values
1
2
3
4
5
fg, ax = plt.subplots()
# just like ax.plot() and ax.scatter() above, we need to pass x-data and y-data to the function
ax.bar(x_data, y_data);
ax.set_xlabel("item");
ax.set_ylabel("frequency");

png

Plotting with pandas

Pandas visualization is a handy tool to create plots out of a pandas dataframe and series. Using pandas visualization is like you are using internal matplotlib functions which exist in Pandas classes (i.e. DataFrame or Series), instead of using external ones which exist in matplotlib classes.

scatter plot

1
scatter_data.plot.scatter(x='quantity', y='item_price', title='the number of items orderered per order price');

png

To draw a scatter plot, we just have to use <data_set>.plot.scatter() function. And pass x='quantity', y='item_price' to specify which columns of the data_set to use. Optionally we can also pass it a title.

Line Chart

1
2
3
# specify columns, pass x, y like <data_set>.plot.scatter() function
line_data.plot.line(x='quantity', y='item_price', title="order data");
# well, this might be a bit scaring...like what? The strike of a missile?

png

1
2
3
# plot over the whole dataframe if we don't specify x, y columns
# unlike matplotlib, it automatically plots all available numeric columns (if we don't specify x, y columns)
line_data.plot.line(title='order data');

png

Histogram

1
hist_data['item_price'].plot.hist();

png

To plot a histogram, you just need to use <Series>.plot.hist().

Note that we can’t use hist_data['item_name'] anymore because here it requires numerical values. This makes more sense as the columns of a histogram are normally positioned over a label that represents a quantitative variable (i.e. a range of numbers). While the columns of a bar chart are usually positioned over a label that represents a categorical variable.

Bar Chart

1
bar_data['item_name'].value_counts().sort_values(ascending=False).plot.bar();

png

To plot a histogram, you just need to use <Series>.plot.bar() or <dataframe>.plot.bar(x='',y=''). To plot a horizontal histogram, replace bar() with barh().

1
bar_data['item_name'].value_counts().sort_values(ascending=False).plot.barh();

png

For more parameter options in each plot function, look them up in official documents. It’s no use to remember all of them becauese there are too much…

matplotlib

pandas visualization