Notebook 1, Module 2, Statistical Inference for Data Science, CAS Applied Data Science, 2020-08-25, G. Conti, S. Haug, University of Bern.
First load the libraries / modules.
# Load the needed python libraries by executing this python code (press ctrl enter)
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import pandas as pd
Load the dataset into a dataframe.
df = pd.read_csv('iris.csv',names=['slength','swidth','plength','pwidth','species'])
df.head() # Print the first five rows
Browse through all rows.
#pd.set_option('display.max_rows', 200)
#df
Print some descriptive statistics.
df[df['species']=='Iris-setosa'].mean()
# to format the output
mean_pwidth = df[df['species'] == 'Iris-setosa'].mean()[3]
print("Mean of petal width = %3.2f"%mean_pwidth) # 3 digits, 2 digits after the dot and number should be a float
# other way to format the output
round(df[df['species']=='Iris-setosa'].mean(),2)
df[df['species']=='Iris-setosa'].median()
What is the difference between the median and the mean?
df[df['species']=='Iris-setosa'].std()
What is the definition of the standard deviation?
Or get the summary.
df[df['species']=='Iris-setosa'].describe()
Do all these digits after the dot make sense?
Calculate the sample variance.
df[df['species']=='Iris-setosa'].var()
What is the definition of the variance?
Calculate the skewness and the kurtosis. How are they defined?
df[df['species']=='Iris-setosa'].kurt()
df[df['species']=='Iris-setosa'].skew()
Calculate and print correlation and covariance matrix.
df[df['species']=='Iris-setosa'].cov()
df[df['species']=='Iris-setosa'].corr()
What is the definition of the correlation?
Study and comment the numbers in the correlation matrix.
Now we have done our descriptive statistics in numbers and tables. Now let us do it with plots. First the histograms.
df_setosa = df[df['species']=='Iris-setosa']
plt.subplot(221)
df_setosa['slength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa Setal', label="length")
ax_s = df_setosa['swidth'].plot(kind="hist",fill=False,histtype='step', label="width")
ax_s.set_xlabel('x [cm]')
ax_s.set_ylabel('A.U.')
plt.legend()
plt.subplot(222)
df_setosa['plength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa Petal', label="length")
ax_s = df_setosa['pwidth'].plot(kind="hist",fill=False,histtype='step', label="width")
ax_s.set_xlabel('x [cm]')
ax_s.set_ylabel('A.U.')
plt.legend()
plt.show()
#$\sigma=\sqrt\frac{1}{n}\sum_{i=0}^{n}{(x_{i}-\bar{x})^2}$
# real-world (population std)
# divide by n-1 => if you don't know the real mean; correct that you use samples not the population
# (divide by little bit smaller number)
Scatter plots.
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.scatter_matrix.html
from pandas.plotting import scatter_matrix
scatter_matrix(df[df['species']=='Iris-setosa'], alpha=0.2, figsize=(6, 6), diagonal='hist')
plt.show()
Now we have studied our data with descriptive statistics. Before we can do the statistical inference, we want to make a model for our data.
Let us start with the setal length of the Setosa species. It looks like a normal distribution so we choose that as a model. When we come to Hypothesis Testing, we will see how to test it mathematically.
Our model will be a normal distribution with the mean and width taken from the dataset: norm.pdf(x,mean,width).
from scipy.stats import norm
mean = df_setosa['slength'].mean()
width = df_setosa['slength'].std()
print(mean,width)
# Create figure and axis
fig, ax = plt.subplots(1,1)
# Create 100 x values and plot the normal pdf for these values
#x = np.linspace(norm.ppf(0.01),norm.ppf(0.99), 100)
x = np.linspace(3,7,80)
ax.plot(x, norm.pdf(x,mean,width),'b-', lw=2, label='Normed pdf')
df_setosa['slength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa Setal', label="length", density="True")
ax.legend(loc='best', frameon=False)
plt.show()
With our model we could now do a lot of inference. Taking a random leave, one could for example test how likely it is to be Iris Setosa.
More for you to practise in the afternoon notebook. Please also look at the implemented descriptive statistics methods in the Pandas and Stats modules.
Descriptive statistics:
Why descriptive statistics ?