Notebook 2, Module 1, Data Aquisition and Data Management, CAS Applied Data Science, 2019-08-22, S. Haug, University of Bern.

1. Visualisation of Data - Examples¶

Learning outcomes:

Participants will be able to make good data science plots, with praxis on

plot line charts from series and dataframes
plot histograms
- understand the effect of binning
plot scatter plots
plot box plots
plot error bars
formatting of plots

Introduction Slides

https://docs.google.com/presentation/d/1HhRIIVq46DyVNm68WeTqr_vZvOgSMWBZa2XDwWNH8H4/edit?usp=sharing

Further sources

Here you have examples on plotting possibilities with pandas. They make data science plotting very easy and fast. However, you may have special needs that are not supported. Then you can use the underlaying plotting module matplotlib.

Plotting is an art and you can spend enourmous amounts of time doing plotting. There are many types of plots. You may invent your own type. We only show some examples and point out some important things. If you need to go further, you have to work indepentently.

Some vocabulary and plots are only understandable with corresponding statistics background. This is part of module 2.

0. Load the modules¶

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

1. Plot line charts (time series)¶

First we use the data structure Series (one dimensional).

# Generate 1000 random numbers for 1000 days from the normal distribution
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.head()
#ts.plot()
#plt.show()

2000-01-01   -0.211086
2000-01-02   -0.315067
2000-01-03   -2.195546
2000-01-04   -1.692554
2000-01-05   -0.174438
Freq: D, dtype: float64

We can generate 4 time series, keep them in a dataframe and plot them all four.

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['All','Bin','C','D'])
df_cumsum = df.cumsum()
plt.figure()
df_cumsum.plot()
plt.show()

<Figure size 432x288 with 0 Axes>

2. Plot histograms (frequency plots)¶

For this we use our Iris dataset.

iris_file = "../CAS-Applied-Data-Science-master/Module-1/iris.csv"
df = pd.read_csv(iris_file,names=['slength','swidth','plength','pwidth','species']) # data type is a string (str), i.e. not converted into numbers
df.head() # print first five lines of data

Plot two histograms with a legend in the same graph.

df['slength'].plot(kind="hist",fill=True,histtype='step',label='slength')
df['swidth'].plot(kind="hist",fill=False,histtype='step',label='swidth')
plt.legend()
plt.show()

The effect of binning¶

When data is binned (or sampled) the bin size effects the amount of counts in each bin. Counts fluctuate like a normal distribution for counts above about 20. So depending on your bin size, the same data may look differently.

Hard binning (small bin size) may introduce pure statistical structures without any other meaning. This is then overfitting. Too big bin sizes may wipe out structures in the data (underfitting). If known, a bin size guided by the physical resolution of the sensor is close to optimal.

Plot the same histograms with a different binning.

df['slength'].plot(bins=10,range=(2,8), kind="hist",fill=False,histtype='step')
plt.show()

Always label the axes (also with units)¶

ax = df['slength'].plot(kind="hist",fill=False,histtype='step',label='slength')
df['swidth'].plot(kind="hist",fill=False,histtype='step',label='swidth')
ax.set_xlabel('x / cm')
ax.set_ylabel('Count / 0.3 cm')
ax.set_title('Sepal Length and Width')
plt.legend()
plt.show()

A figure with several plots

3. Scatter plots¶

Scatter plots show how the data is distributed in two dimensions. They are good for finding (anti) correlations between two variables. We plot several plots in one figure.

df.plot(x='slength',y='swidth',kind="scatter",c='c')
plt.show()

With the plotting module there are some nice tools. For example authomatic plotting of all scatter plots.

from pandas.plotting import scatter_matrix
scatter_matrix(df[df['species']=='Iris-setosa'], alpha=0.2, figsize=(8, 8), diagonal='hist')
plt.show()

Or plotting of Andrew curves. https://en.wikipedia.org/wiki/Andrews_plot

from pandas.plotting import andrews_curves
andrews_curves(df, 'species')
plt.show()

There are several other tools too. See https://pandas.pydata.org/pandas-docs/stable/visualization.html

4. Box plots¶

Boxplot can be drawn calling Series.plot.box() and DataFrame.plot.box(), or DataFrame.boxplot() to visualize the distribution of values within each column.

color = dict(boxes='DarkGreen', whiskers='DarkOrange',
             medians='DarkBlue', caps='Gray')
df.plot.box(color=color)
plt.show()

Box plots are non-parametric. The box shows the first second and third quartiles. The whiskers may be standard deviations or other percentiles.

5. Plotting with error bars¶

There is no science without error bars, or better, uncertainties. The meaning of uncertainties is discussed in module 2. Here we only show by example how to plot uncertainties.

Plotting with error bars is supported in DataFrame.plot() and Series.plot().

Horizontal and vertical error bars can be supplied to the xerr and yerr keyword arguments to plot(). The error values can be specified using a variety of formats:

As a DataFrame or dict of errors with column names matching the columns attribute of the plotting DataFrame or matching the name attribute of the Series.
As a str indicating which of the columns of plotting DataFrame contain the error values.
As raw values (list, tuple, or np.ndarray). Must be the same length as the plotting DataFrame/Series.

Asymmetrical error bars are also supported, however raw error values must be provided in this case. For a M length Series, a Mx2 array should be provided indicating lower and upper (or left and right) errors. For a MxN DataFrame, asymmetrical errors should be in a Mx2xN array.

Here is an example using an error dataframe (symmetric uncertainties).

my_df   = pd.DataFrame([6,15,4,20,16,13]) # Some random data
my_df_e = (my_df)**0.5 # The error dataframe
my_df.plot(yerr=my_df_e)
plt.show()

/usr/local/lib/python3.5/dist-packages/pandas/core/computation/check.py:19: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.6.1

  ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)

6. Formatting plots¶

Plots can easily be formatted with keywords. One can adjust colors, types of shading, lines, axes, legends, titles, etc. Some formatting has been exemplified above. More examples are in the documentation. https://pandas.pydata.org/pandas-docs/stable/visualization.html

With the matplotlib module you are even more flexible. See https://matplotlib.org/gallery/index.html for inspirations.

7. Summary¶

Do you remember three important plot types?
What can the binning of a histogram do to the interpretation of it?
What are the three parts of the general communication process?
Can you mention three important points to include in plots and their figure legends?

End of the prepared data visualisation examples.¶

Feel free to play on your own. You should make some good plots of your dataset for the project report.

	slength	swidth	plength	pwidth	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa