Notebook 1, Module 2, Statistical Inference for Data Science, CAS Applied Data Science, 2019-08-27, G. Conti, S. Haug, University of Bern.

2. Descriptive statistics

DEMONSTRATION

  • Do descriptive statistics with the Iris dataset
  • Make a model

First load the libraries / modules.

In [4]:
# Load the needed python libraries by executing this python code (press ctrl enter)
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import pandas as pd

Load the dataset into a dataframe.

In [5]:
df = pd.read_csv('iris.csv',names=['slength','swidth','plength','pwidth','species'])
df.head() # Print the first five rows
Out[5]:
slength swidth plength pwidth species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Browse through all rows.

In [6]:
pd.set_option('display.max_rows', 10)
df
Out[6]:
slength swidth plength pwidth species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 5 columns

Print some descriptive statistics.

In [7]:
df[df['species']=='Iris-setosa'].mean()
Out[7]:
slength    5.006
swidth     3.418
plength    1.464
pwidth     0.244
dtype: float64
In [8]:
df[df['species']=='Iris-setosa'].median()
Out[8]:
slength    5.0
swidth     3.4
plength    1.5
pwidth     0.2
dtype: float64
In [9]:
df[df['species']=='Iris-setosa'].std()
Out[9]:
slength    0.352490
swidth     0.381024
plength    0.173511
pwidth     0.107210
dtype: float64

Or get the summary.

In [10]:
df[df['species']=='Iris-setosa'].describe()
Out[10]:
slength swidth plength pwidth
count 50.00000 50.000000 50.000000 50.00000
mean 5.00600 3.418000 1.464000 0.24400
std 0.35249 0.381024 0.173511 0.10721
min 4.30000 2.300000 1.000000 0.10000
25% 4.80000 3.125000 1.400000 0.20000
50% 5.00000 3.400000 1.500000 0.20000
75% 5.20000 3.675000 1.575000 0.30000
max 5.80000 4.400000 1.900000 0.60000

Calculate the sample variance.

In [11]:
df[df['species']=='Iris-setosa'].var()
Out[11]:
slength    0.124249
swidth     0.145180
plength    0.030106
pwidth     0.011494
dtype: float64

Calculate the skewness and the kurtosis.

In [12]:
df[df['species']=='Iris-setosa'].kurt()
Out[12]:
slength   -0.252689
swidth     0.889251
plength    1.031626
pwidth     1.566442
dtype: float64
In [13]:
df[df['species']=='Iris-setosa'].skew()
Out[13]:
slength    0.120087
swidth     0.107053
plength    0.071846
pwidth     1.197243
dtype: float64

Calculate and print correlation and covariance matrix.

In [14]:
df[df['species']=='Iris-setosa'].cov()
Out[14]:
slength swidth plength pwidth
slength 0.124249 0.100298 0.016139 0.010547
swidth 0.100298 0.145180 0.011682 0.011437
plength 0.016139 0.011682 0.030106 0.005698
pwidth 0.010547 0.011437 0.005698 0.011494
In [15]:
df[df['species']=='Iris-setosa'].corr()
Out[15]:
slength swidth plength pwidth
slength 1.000000 0.746780 0.263874 0.279092
swidth 0.746780 1.000000 0.176695 0.279973
plength 0.263874 0.176695 1.000000 0.306308
pwidth 0.279092 0.279973 0.306308 1.000000

Study and comment these numbers.

Now we have done our descriptive statistics in numbers and tables. Now let us do it with plots. First the histograms.

In [16]:
df_setosa = df[df['species']=='Iris-setosa']

plt.subplot(221)
df_setosa['slength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa Setal', label="length")
ax_s = df_setosa['swidth'].plot(kind="hist",fill=False,histtype='step', label="width")
ax_s.set_xlabel('x [cm]')
ax_s.set_ylabel('A.U.')
plt.legend()

plt.subplot(222)
df_setosa['plength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa Petal', label="length")
ax_s = df_setosa['pwidth'].plot(kind="hist",fill=False,histtype='step', label="width")
ax_s.set_xlabel('x [cm]')
ax_s.set_ylabel('A.U.')
plt.legend()

plt.show()

Scatter plots.

In [17]:
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.scatter_matrix.html
from pandas.plotting import scatter_matrix
scatter_matrix(df[df['species']=='Iris-setosa'], alpha=0.2, figsize=(6, 6), diagonal='hist')
plt.show()

Now we have studied our data with descriptive statistics. Before we can do the statistical inference, we want to make a model for our data.

Let us start with the setal length of the Setosa species. It looks like a normal distribution so we choose that as a model. When we come to Hypothesis Testing, we will see how to test it mathematically.

Our model will be a normal distribution with the mean and width taken from the dataset: norm.pdf(x,mean,width).

In [18]:
from scipy.stats import norm
mean  = df_setosa['slength'].mean()
width = df_setosa['slength'].std()
print(mean,width)
# Create figure and axis
fig, ax = plt.subplots(1,1)
# Create 100 x values and plot the normal pdf for these values
#x = np.linspace(norm.ppf(0.01),norm.ppf(0.99), 100)
x = np.linspace(3,7,80)
ax.plot(x, norm.pdf(x,mean,width),'b-', lw=2, label='Normed pdf')
df_setosa['slength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa Setal', label="length", density="True")
ax.legend(loc='best', frameon=False)
plt.show()
5.006 0.35248968721345136

With our model we could now do a lot of inference. Taking a random leave, one could for example test how likely it is to be Iris Setosa.

More for you to practise in the afternoon notebook.

In [19]:
#norm.pdf(5,mean,width)
norm.cdf(4,mean,width)
Out[19]:
0.002158733887404777