Notebook 1, Module 2, Statistical Inference for Data Science, CAS Applied Data Science, 2019-08-27, G. Conti, S. Haug, University of Bern.
First load the libraries / modules.
# Load the needed python libraries by executing this python code (press ctrl enter)
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import pandas as pd
Load the dataset into a dataframe.
df = pd.read_csv('iris.csv',names=['slength','swidth','plength','pwidth','species'])
df.head() # Print the first five rows
Browse through all rows.
pd.set_option('display.max_rows', 10)
df
Print some descriptive statistics.
df[df['species']=='Iris-setosa'].mean()
df[df['species']=='Iris-setosa'].median()
df[df['species']=='Iris-setosa'].std()
Or get the summary.
df[df['species']=='Iris-setosa'].describe()
Calculate the sample variance.
df[df['species']=='Iris-setosa'].var()
Calculate the skewness and the kurtosis.
df[df['species']=='Iris-setosa'].kurt()
df[df['species']=='Iris-setosa'].skew()
Calculate and print correlation and covariance matrix.
df[df['species']=='Iris-setosa'].cov()
df[df['species']=='Iris-setosa'].corr()
Study and comment these numbers.
Now we have done our descriptive statistics in numbers and tables. Now let us do it with plots. First the histograms.
df_setosa = df[df['species']=='Iris-setosa']
plt.subplot(221)
df_setosa['slength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa Setal', label="length")
ax_s = df_setosa['swidth'].plot(kind="hist",fill=False,histtype='step', label="width")
ax_s.set_xlabel('x [cm]')
ax_s.set_ylabel('A.U.')
plt.legend()
plt.subplot(222)
df_setosa['plength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa Petal', label="length")
ax_s = df_setosa['pwidth'].plot(kind="hist",fill=False,histtype='step', label="width")
ax_s.set_xlabel('x [cm]')
ax_s.set_ylabel('A.U.')
plt.legend()
plt.show()
Scatter plots.
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.scatter_matrix.html
from pandas.plotting import scatter_matrix
scatter_matrix(df[df['species']=='Iris-setosa'], alpha=0.2, figsize=(6, 6), diagonal='hist')
plt.show()
Now we have studied our data with descriptive statistics. Before we can do the statistical inference, we want to make a model for our data.
Let us start with the setal length of the Setosa species. It looks like a normal distribution so we choose that as a model. When we come to Hypothesis Testing, we will see how to test it mathematically.
Our model will be a normal distribution with the mean and width taken from the dataset: norm.pdf(x,mean,width).
from scipy.stats import norm
mean = df_setosa['slength'].mean()
width = df_setosa['slength'].std()
print(mean,width)
# Create figure and axis
fig, ax = plt.subplots(1,1)
# Create 100 x values and plot the normal pdf for these values
#x = np.linspace(norm.ppf(0.01),norm.ppf(0.99), 100)
x = np.linspace(3,7,80)
ax.plot(x, norm.pdf(x,mean,width),'b-', lw=2, label='Normed pdf')
df_setosa['slength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa Setal', label="length", density="True")
ax.legend(loc='best', frameon=False)
plt.show()
With our model we could now do a lot of inference. Taking a random leave, one could for example test how likely it is to be Iris Setosa.
More for you to practise in the afternoon notebook.
#norm.pdf(5,mean,width)
norm.cdf(4,mean,width)