Notebook 2, Module 1, Data Aquisition and Data Management, CAS Applied Data Science, 2019-08-22, S. Haug, University of Bern.
Learning outcomes:
Participants will be able to make good data science plots, with praxis on
Introduction Slides
Further sources
Here you have examples on plotting possibilities with pandas. They make data science plotting very easy and fast. However, you may have special needs that are not supported. Then you can use the underlaying plotting module matplotlib.
Plotting is an art and you can spend enourmous amounts of time doing plotting. There are many types of plots. You may invent your own type. We only show some examples and point out some important things. If you need to go further, you have to work indepentently.
Some vocabulary and plots are only understandable with corresponding statistics background. This is part of module 2.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
First we use the data structure Series (one dimensional).
# Generate 1000 random numbers for 1000 days from the normal distribution
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.head()
#ts.plot()
#plt.show()
We can generate 4 time series, keep them in a dataframe and plot them all four.
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['All','Bin','C','D'])
df_cumsum = df.cumsum()
plt.figure()
df_cumsum.plot()
plt.show()
For this we use our Iris dataset.
iris_file = "../CAS-Applied-Data-Science-master/Module-1/iris.csv"
df = pd.read_csv(iris_file,names=['slength','swidth','plength','pwidth','species']) # data type is a string (str), i.e. not converted into numbers
df.head() # print first five lines of data
Plot two histograms with a legend in the same graph.
df['slength'].plot(kind="hist",fill=True,histtype='step',label='slength')
df['swidth'].plot(kind="hist",fill=False,histtype='step',label='swidth')
plt.legend()
plt.show()
When data is binned (or sampled) the bin size effects the amount of counts in each bin. Counts fluctuate like a normal distribution for counts above about 20. So depending on your bin size, the same data may look differently.
Hard binning (small bin size) may introduce pure statistical structures without any other meaning. This is then overfitting. Too big bin sizes may wipe out structures in the data (underfitting). If known, a bin size guided by the physical resolution of the sensor is close to optimal.
Plot the same histograms with a different binning.
df['slength'].plot(bins=10,range=(2,8), kind="hist",fill=False,histtype='step')
plt.show()
ax = df['slength'].plot(kind="hist",fill=False,histtype='step',label='slength')
df['swidth'].plot(kind="hist",fill=False,histtype='step',label='swidth')
ax.set_xlabel('x / cm')
ax.set_ylabel('Count / 0.3 cm')
ax.set_title('Sepal Length and Width')
plt.legend()
plt.show()
A figure with several plots
Scatter plots show how the data is distributed in two dimensions. They are good for finding (anti) correlations between two variables. We plot several plots in one figure.
df.plot(x='slength',y='swidth',kind="scatter",c='c')
plt.show()
With the plotting module there are some nice tools. For example authomatic plotting of all scatter plots.
from pandas.plotting import scatter_matrix
scatter_matrix(df[df['species']=='Iris-setosa'], alpha=0.2, figsize=(8, 8), diagonal='hist')
plt.show()
Or plotting of Andrew curves. https://en.wikipedia.org/wiki/Andrews_plot
from pandas.plotting import andrews_curves
andrews_curves(df, 'species')
plt.show()
There are several other tools too. See https://pandas.pydata.org/pandas-docs/stable/visualization.html
Boxplot can be drawn calling Series.plot.box() and DataFrame.plot.box(), or DataFrame.boxplot() to visualize the distribution of values within each column.
color = dict(boxes='DarkGreen', whiskers='DarkOrange',
medians='DarkBlue', caps='Gray')
df.plot.box(color=color)
plt.show()
Box plots are non-parametric. The box shows the first second and third quartiles. The whiskers may be standard deviations or other percentiles.
There is no science without error bars, or better, uncertainties. The meaning of uncertainties is discussed in module 2. Here we only show by example how to plot uncertainties.
Plotting with error bars is supported in DataFrame.plot() and Series.plot().
Horizontal and vertical error bars can be supplied to the xerr and yerr keyword arguments to plot(). The error values can be specified using a variety of formats:
Asymmetrical error bars are also supported, however raw error values must be provided in this case. For a M length Series, a Mx2 array should be provided indicating lower and upper (or left and right) errors. For a MxN DataFrame, asymmetrical errors should be in a Mx2xN array.
Here is an example using an error dataframe (symmetric uncertainties).
my_df = pd.DataFrame([6,15,4,20,16,13]) # Some random data
my_df_e = (my_df)**0.5 # The error dataframe
my_df.plot(yerr=my_df_e)
plt.show()
Plots can easily be formatted with keywords. One can adjust colors, types of shading, lines, axes, legends, titles, etc. Some formatting has been exemplified above. More examples are in the documentation. https://pandas.pydata.org/pandas-docs/stable/visualization.html
With the matplotlib module you are even more flexible. See https://matplotlib.org/gallery/index.html for inspirations.
Feel free to play on your own. You should make some good plots of your dataset for the project report.