Notebook 2, Module 2, Statistical Inference for Data Science, CAS Applied Data Science, 2020-08-24, G. Conti, S. Haug, University of Bern.
Average expected study time : 3x45 min (depending on your background). Your are supposed to play with the examples: change them, maybe test on another dataset. From just executing them, you will not learn much.
Learning outcomes :
For mathematicians or very interested people
# Load the needed python libraries by executing this python code (press ctrl enter)
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
print('Congrats, you just loaded numpy, scipy.stats and mathplot.pyplot loaded !')
In practice the measurement process, the data taking, is a random, or stochastic, process. The outcome varies from measurement to measurement. There are three (at least) reasons:
Our Iris dataset is a sample from 50 flowers in each class. So in each class there are 50 varying measurements for each of the four observables, sepal and petal length and width. This is due to the first and maybe the second reason (quantum mechanics can be neclected at scales larger than molecules). In descriptive statistics the observables are therefore called random variables. Let us call one x for examplification.
If x can take on any value from a continuous range, we write $f(x;\theta)dx$ as the probability that the measurement’s outcome lies between x and $x + dx$. The function $f (x; \theta)$ is called the probability density function (p.d.f.), which may depend on one or more parameters $\theta$ (for example the Iris class).
A random variable can be discrete or continuous. If discrete, then we use $f(x;\theta)$ to denote the probability to find the value x (in python the term probability mass fundtion, pmf, is then used). In the following the term p.d.f. is often taken to cover both the continuous and discrete cases, although technically the term density should only be used in the continuous case.
The p.d.f. is always normalized to unity (the number 1), i.e. the integral or the surface under the curve equals one. Both x and $\theta$ may have multiple components and are then often written as vectors. If $\theta$ is unknown, we may wish to estimate its value from a given set of measurements of x; this is a central topic of statistics (see next notebook on parameter estimation and regression).
The p.d.f. should be chosen to describe the fluctuation of the random variable in a best possible way. In other words, we should always choose an approprate p.d.f to describe our data. Some very useful and much used p.d.f. follow.
The normal (or Gaussian) probability density function is probably the most used one (informally the "bell curve"). It derives its importance in large part from the central limit theorem: "In most situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed." https://en.wikipedia.org/wiki/Central_limit_theorem
Example: If one flips a coin many times the probability of getting a given number of heads in a series of flips will approach a normal curve, with mean equal to half the total number of flips in each series. (In the limit of an infinite number of flips, it will equal a normal curve.)
This means that in many or most cases it is sufficient to know the characteristics of the normal p.d.f. Others can be looked up if needed. Also often unspecified statements like the error, or better, the uncertainty refer to their meaning on the normal p.d.f.
As a formula the normal distribution function looks like (in one dimension)
$$ f(x;\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}} \exp(\frac{-(x-\mu)^2}{2\sigma^2}) $$It reads, given the distribution parameters mean $\mu$ and variance $\sigma$, x follows this function.
Plot the normal distribution with mean 0 and variance 5 for 400 x values between -20 to 20. Repeat this for two other means and variances. How big is the surface under the curves ?
(See also scipy.stat.norm https://docs.scipy.org/doc/scipy-0.16.1/reference/generated/scipy.stats.norm.html)
# Part of the solution:
x = np.linspace(-20,20,400) # 400 bins from -20 to 20
plt.plot(x, scipy.stats.norm.pdf(x,0,5))
plt.plot(x, scipy.stats.norm.pdf(x,0,2))
Since data taking is data taking of random variables, we need to define and talk about probability. In mathematichs probability is defined in a rather abstract manner (see Annex below). For our purposes we go directly to the interpreation as either relative frequency or subjective probability. If A is a possible outcome of an experiment repeated n times, then the probability of A is the realtive frequency
$$P(A) = \lim_{n\rightarrow \infty} \frac{times \; outcome \;is\;A}{n}$$The subjective probability is
$$ P(A) = degree\;of\;belief\;that \;A \; is\; true$$Both concepts are consistent with the abstract mathematical definition .
From this definition and using the fact that $A ∩ B$ and $B ∩ A$ (intersection) are the same, one obtains Bayes’ theorem
$$P (A|B) = \frac{P(B|A)P(A)}{P(B)}$$first published by the Reverend Thomas Bayes (1702-1761). Statistics based on the relative frequency interpretation of probability is called frequentist statistics, on bayesian theorem bayesian statistics.
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation (see moments), and inferential statistics, which draw conclusions from data that are subject to random variation (e.g. observational errors, sampling variation).
The $n^{th}$ moment of a random variable x with p.d.f. $f_(x)$ is $$ \alpha_n \equiv E[x^n] = \int_{-\infty}^{\infty} x^nf(x)dx $$ In the discrete case this integral becomes the sum known as the arithemtic mean: $$ \mu = \frac{1}{n} \sum_{i=1}^n x_i$$
The most commonly used moments are the mean $\mu$ (or expectation value) and variance $\sigma^2$:
$$\mu \equiv \alpha_1 $$$$\sigma^2 \equiv V[x] \equiv \int_{-\infty}^{\infty} (x-\mu)^2 f(x)dx = ... = \alpha_2 - \mu^2 $$The mean is the location of the “center of mass” of the p.d.f., and the variance is a measure of the square of its width. It is often convenient to use the standard deviation (SD) of $x$, $\sigma$, defined as the square root of the variance. In the discrete case the variance becomes $$ \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2$$
For the normal p.d.f the standard deviation is its width.
Based on higher moments other distribution descriptors are formed. Skewness and kurtosis you may encounter. The skewness is a number indicating the deviation from a symmetric form. Kurtosis is a number indicating if the tails of the distribution is larger or smaller then the tails of the normal distribution.
The quantile $x_{\alpha}$ is the value of the random variable x at which $\alpha$% of the area is below x. An important special case is the median, $x_{med} \equiv x_{50}$. At the median half the area lies above and half lies below. For the normal p.d.f. the median equals the mean. The most probable value of a distribution is called mode.
Special quantiles are the quartiles and percentiles. The first quartile is the $x_{25}$, the second the $x_{50}$ etc. Percentiles are for example $x_{13}$ etc.
Get some desciptive statistics from a normal (continous) p.d.f. with mean 0 and SD 4.
mean, variance, skewness, kurtosis = scipy.stats.norm.stats(0,4,moments='mvsk')
print(mean, variance, skewness, kurtosis)
Plot the p.d.f and some moments
x = np.linspace(-20,20,400)
sigma=variance**0.5
plt.plot(x,scipy.stats.norm.pdf(x,mean,sigma))
plt.axvline(x=mean, linewidth=2, color = 'k',label="Mean") # Plot the mean as a vertical line
plt.axvline(x=mean-sigma, linewidth=2, color = 'r', label="Sigma (standard deviation)")
plt.axvline(x=mean+sigma, linewidth=2, color = 'r')
plt.legend(loc='upper right')
Do the same with the discrete and skew poisson p.m.f. See also https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html
m, v, s, k = scipy.stats.poisson.stats(1.7,moments='mvsk')
#my_norm = norm(0,2)
print('Mean = %1.2f Var = %1.2f Std = %1.2f Skewness = %1.2f kurtosis = %1.2f' % (m,v,v**0.5,s,k))
#my_norm.moments()
Plot mode and median of a poisson p.d.f.
dist = scipy.stats.poisson(1.7)
x = np.arange(-1, 10)
sigma=variance**0.5
plt.plot(x,dist.pmf(x),linestyle='steps-mid')
dist.median()
plt.axvline(x=mean, linewidth=2, color = 'k',label="Mean")
plt.axvline(x=mean+sigma, linewidth=2, color = 'r', label="Sigma (standard deviation)")
plt.axvline(x=dist.median(), linewidth=2, color = 'b',label="Median")
plt.legend(loc='upper right')
Some words on the Poisson distribution
The Poisson distribution is popular for modelling the number of times an event occurs in an interval of time or space. For example:
The probability mass function is $$f(k;\lambda) = \frac{\lambda^k exp(-\lambda)}{k!}$$ For large k the normal distribution is an excellent approximation of the poisson p.d.f. For k below 20 one should be careful using statements based on the normal distribution, e.g. the standard deviation is not symmetric anymore.
x = np.arange(-1, 40)
plt.plot(x,scipy.stats.poisson.pmf(x,2),linestyle='steps-mid')
plt.plot(x,scipy.stats.poisson.pmf(x,15),linestyle='steps-mid')
We see that a possion p.d.f. with mean 15 looks very much like the normal distribution.
The binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while the product np remains fixed or at least p tends to zero. Therefore, the Poisson distribution with parameter λ = np can be used as an approximation to B(n, p) of the binomial distribution if n is sufficiently large and p is sufficiently small. According to two rules of thumb, this approximation is good if n ≥ 20 and p ≤ 0.05, or if n ≥ 100 and np ≤ 10.
For n>20 and p not too close to 1 or 0, the normal distribution is also here a good approximation.
x = np.arange(-1, 40)
plt.plot(x,scipy.stats.binom.pmf(x,40,0.05),linestyle='steps-mid')
plt.plot(x,scipy.stats.binom.pmf(x,40,0.2),linestyle='steps-mid')
plt.plot(x,scipy.stats.binom.pmf(x,40,0.5),linestyle='steps-mid')
plt.plot(x,scipy.stats.binom.pmf(x,40,0.8),linestyle='steps-mid')
Get yourself a cup of something and think about these questions.
Produce the descriptive statistics for the Iris Virginica data. Then plot the histograms and scatter plots.
import pandas as pd
dataframe = pd.read_csv('iris.csv',names=['slength','swidth','plength','pwidth','name'])
dataframe.head()
df_setosa = dataframe[dataframe['name']=='Iris-setosa']
df_setosa.mean()
df_setosa.median()
df_setosa.describe() # Or get a summary
All the descriptive statistics methods for python dataframes are listed here: https://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats
Now we looked at the numbers. Let's us plot the distributions.
df_setosa['slength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa', label="length")
ax = df_setosa['swidth'].plot(kind="hist",fill=False,histtype='step', label="width")
ax.set_xlabel('Setal length/width [cm]')
ax.set_ylabel('Frequency')
plt.legend()
df_setosa['plength'].plot(kind="hist",fill=False,histtype='step',title='Iris Setosa', label="length")
ax = df_setosa['pwidth'].plot(kind="hist",fill=False,histtype='step', label="width")
ax.set_xlabel('Petal length/width [cm]')
ax.set_ylabel('Frequency')
plt.legend()
It is hard to say by eye if the distributions are normally distributed. Later this week, we'll use statistical tests to find that out. The binning is also not equal. To define equal binnings see example from yesterday. The petal length is funny.
We can complete our descriptive studies by looking at the scatter plots (correlations).
It is important to obtain some routine with the computation of probabilities and quantiles.
Let X be binomially distributed with n = 60 and p = 0.4. Compute the following.
#write your code here
import scipy.stats as sts
x = 24
n = 60
p = 0.4
prop_bin = sts.binom.pmf(x , n, p)
print("P(X = 24) is equal to %.2f" % prop_bin)
cum_bin = sts.binom.cdf(x , n, p)
print("P(X ≤ 24) is equal to %.2f" % cum_bin)
One can simulate data sets by generating them from probability density functions. The computer does this with a so called Monte Carlo (MC) algorithm. It draws x values (pseudo) randomly from the given distribution. The the actual draws of the random variable are called random variates. Simulations can be very useful when planning an experiment and developing the analysis method. Instead of real data one can use the simulated data.
Let us simulate the Iris setosa sepal width data (width the mean and standard deviation we got from the real data).
n = scipy.stats.norm.rvs(3.418,0.318,50) # 100 random values from a normal distribution with mean 3.418 and SD 0.318
print(n[0:10]) # Print first 10
# Put the simulated data into a dataframe
import pandas as pd
df_setosa_sim = pd.DataFrame(n)
df_setosa_sim.head()
This is just for your reference, no need to understand all these distributions. A quite good free and comprehensive book is here: http://staff.fysik.su.se/~walck/suf9601.pdf.
Table 1.4.1 Some common probability density functions, with corresponding characteristic functions and means and variances. In the Table, $\Gamma(k)$ is the gamma function, equal to $(k − 1)!$ when $k$ is an integer; $_1F_1$ is the confluent hypergeometric function of the 1st kind [11].
Distributions are either discrete or continuous. To study the distributions we use the statistical functions of scipy.stats (https://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html). There you also find more distributions than listed here.
All data have uncertainties. These should always be communicated when showing scientific numbers or plots. We distinguish between two types.
The statistics tools can mostly handle the statistical uncertainties. There is no mathematical recipe for dealing with systematical uncertainties. You have to think through your experiment and try to estimate the influence of everything that can go wrong.
When uncertainties are stated on numbers or in graphs as error bars or error bands they generally show one standard deviation. If the data are well described by a normal p.d.f, the interpretation of one standard deviation is clear: if the measurement is repeated many times, 32% (or about 1/3 of the measurements) should be outside the error bars.
If the p.d.f is not normal, the interpretation of one standard deviation is not clear. For example, for a poisson p.d.f. it is not symmetric. So again we see, for low counts where the normal asumption is not a good approximation, let's say below 20, the interpretation is not obvious anymore and care is needed.
Very nice is the fact that the standard deviation of a poisson distribution is $\sqrt{\mu}$. So if you have a count that is larger than around 20 (then the normal interpretation is a good approximation), you get the standard deviation by taking the square root of the count.
The following code may be difficult to understand, ask on the chat or tomorrow.
# Draw a histogram which is not normalised
entries1, edges, patches = plt.hist(n, bins=10, histtype='step')
# Close plt so that this histogram is not shown
plt.close()
# Draw a histogram which IS normed
entries2, edges, patches = plt.hist(n, bins=10, histtype='step',density=True)
# Calculate the poisson standard deviation and scale down to second histogram
errors = np.sqrt(entries1) * entries2/entries1
# calculate bin centers
bin_centers = 0.5 * (edges[:-1] + edges[1:])
# draw errobars, use the sqrt error.
plt.errorbar(bin_centers, entries2, yerr=errors, fmt='r.')
# Draw a normal distribution
x = np.linspace(2.4,4.2,100)
sigma=variance**0.5
plt.plot(x,scipy.stats.norm.pdf(x,df_setosa_sim.mean(),df_setosa_sim.std()))
plt.show()
We see that 3 out of 10 data points are more than one standard deviation off the "theory" curve. This is how it should be.
Submit one question to this notebook for tomorrow's discussion session: https://forms.gle/k1xmRZaKANP1zARZ6
An abstract definition of probability can be given by considering a set $S$, called the sample space, and possible subsets $A,B,...$ the interpretation of which is left open. The probability $P$ is a real-valued function defined by the following axioms due to Kolmogorov (1933) [9]:
From this further properties can be derives, e.g.
In addition, one defines the conditional probability $P(A|B)$ (read as $P$ of $A$ given $B$) as $$P(A|B) = \frac{P(A ∩ B)}{P(B)}$$
As an example, when throwing the dice, consider obtaining more than 3 eyes given only trows with even number of eyes outcomes. We calculate the (conditional) probability:
$$P(n>3|n\; even) = \frac{P(n>3 \cap n\; even)}{P(even)} = \frac{2/6}{3/6} = \frac{2}{3}$$If A and B are independent, then
$$P(A|B) = \frac{P(A ∩ B)}{P(B)} = \frac{P(A)P(B)}{P(B)} = P(A)$$