Notebook 4, Module 2, Statistical Inference for Data Science, CAS Applied Data Science, 2019-08-30, G. Conti, S. Haug, University of Bern.
This is my project work for Module 2. Plots, numbers and tables for the poster is produced with this Notebook.
The data used is the Iris data set. Available here: https://archive.ics.uci.edu/ml/datasets/iris
Outline of the data analysis 3x45 min (depending on your background)
I am in court as a data science expert. My task is to give advice to the judges. I will have 15 minutes time and plan to bring a poster for the presentation. This way court participants can also look at it in the breaks.
Situation (fictional):
Some new company v4Setosa recently sequenced the genes of the Iris species Setosa and patented it, apparently in order to preserve this species because it is so beautiful. Due this patent it is not allowed to change the plant.
A big farmer and hater of Iris and with a field where Iris is a disturbing weed, has been using a new product from Sonte Manto for a couple of years. The product is supposed to effectively kill Iris plants.
A big Iris lover collected a sample of Iris plants from the farmer's field and thinks the Iris Setosa setal leaves are bigger than normal. She sent the sample to v4setosa, which in turn came to the conclusion that Setosa must have mutated due to the product from Sonte Manto.
So w4setosa sued Sonte Manto with the claim that they have changed the plant with their product. Sonte Manto may risk to pay a billion dollars. The court has asked me to give a neutral and scientific advice.
Data Analysis
Definition of significant
We consider p-values below 0.001 as unlikely enough to reject the fluctuation hypothesis. We will for such values advice the court to consider it as a mutation. (This is a weak point that the laywers of Sonte Manto will of course attack.)
We artificially create the datasets. Out of the first 30 entries from each Iris species make the reference sets (annotated/labeled). With the rest, 20 rows per species, we create a mixed sample being the farmer sample (not anotated).
Import modules we may need
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
Create the datasets
df_org = pd.read_csv('iris.csv',names=['slength','swidth','plength','pwidth','species']) # data type is a string (str), i.e. not converted into numbers
df_ref_set = df_org[df_org['species']=='Iris-setosa'][:30]
df_ref_vir = df_org[df_org['species']=='Iris-virginica'][:30]
df_ref_ver = df_org[df_org['species']=='Iris-versicolor'][:30]
df_farm_1 = df_org[df_org['species']=='Iris-setosa'][30:]
df_farm_2 = df_org[df_org['species']=='Iris-virginica'][30:]
df_farm_3 = df_org[df_org['species']=='Iris-versicolor'][30:]
df_farm = df_org[df_org['species']=='Iris-virginica'][30:]
df_farm = df_farm_1.append(df_farm_2.append(df_farm_3))
The labels in the farmer dataset are unknown, so we remove them to 'simulate' the real situation.
df_farm['species']='Unknown'
df_farm.head()
Describe setosa
df_ref_set.describe()
df_ref_vir.describe()
df_ref_ver.describe()
I'will make a table suited for the poster:
#Python code for that table here.
Check normality
print('p-values from the normality tests on the setal width (D Agostino-Pearson):')
datasets = [df_ref_set['swidth'],df_ref_vir['swidth'],df_ref_ver['swidth'],df_farm['swidth'] ]
dataset_labels = ['Setosa','Virginica','Versicolor','Farmer']
i = 0
for dataset in datasets:
k2, p = stats.normaltest(dataset) # D Agostino-Pearson
print('%10s %1.2f ' % (dataset_labels[i],p))
i+=1
We conclude that all datasets are normal in the setal width.
Plot the histograms (looking for fitering posibilities)
plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
plt.title('Setal Length')
i=0
for dataset in datasets[0:3]:
dataset.plot(kind="hist",fill=True,alpha=0.2,histtype='step',label=dataset_labels[i])
#print('%10s %1.2f ' % (datasets_labels[i],p))
i+=1
plt.legend()
plt.subplot(2,2,2)
plt.title('Setal Length')
datasets2 = [df_ref_set['slength'],df_ref_vir['slength'],df_ref_ver['slength'],df_farm['slength'] ]
i=0
for dataset in datasets2[0:3]:
dataset.plot(kind="hist",fill=True,alpha=0.2,histtype='step',label=dataset_labels[i])
#print('%10s %1.2f ' % (datasets_labels[i],p))
i+=1
plt.legend()
#plt.show()
plt.subplot(2,2,3)
plt.title('Petal Width')
datasets3 = [df_ref_set['pwidth'],df_ref_vir['pwidth'],df_ref_ver['pwidth'],df_farm['pwidth'] ]
i=0
for dataset in datasets3[0:3]:
dataset.plot(kind="hist",fill=True,alpha=0.2,histtype='step',label=dataset_labels[i])
#print('%10s %1.2f ' % (datasets_labels[i],p))
i+=1
plt.legend()
#plt.show()
plt.subplot(2,2,4)
plt.title('Petal Length')
datasets4 = [df_ref_set['plength'],df_ref_vir['plength'],df_ref_ver['plength'],df_farm['plength'] ]
i=0
for dataset in datasets4[0:3]:
dataset.plot(kind="hist",fill=True,alpha=0.2,histtype='step',label=dataset_labels[i])
#print('%10s %1.2f ' % (datasets_labels[i],p))
i+=1
plt.legend()
plt.show()
It is clear that we filter out setosa more or less completly with p_width<0.7 cm and plength<2.5 cm. Let's look at the populations in this this space as a scatter plot.
plt.subplot(1,1,1)
plt.scatter(df_ref_set['plength'],df_ref_set['pwidth'],c='b',label='Setosa')
plt.scatter(df_ref_vir['plength'],df_ref_vir['pwidth'],c='g',label='Virginica')
plt.scatter(df_ref_ver['plength'],df_ref_ver['pwidth'],c='r',label='Versicolor')
plt.xlabel('plength')
plt.ylabel('pwidth')
plt.legend()
plt.show()
A discrimination should work very well. However, we don't see the full distribution and this is just by eye. I will take a look at the sepal version of the plot, too.
plt.subplot(1,1,1)
plt.scatter(df_ref_set['slength'],df_ref_set['swidth'],c='b',label='Setosa')
plt.scatter(df_ref_vir['slength'],df_ref_vir['swidth'],c='g',label='Virginica')
plt.scatter(df_ref_ver['slength'],df_ref_ver['swidth'],c='r',label='Versicolor')
plt.xlabel('slength')
plt.ylabel('swidth')
plt.legend()
plt.show()
Also here a good separation seems easy by using a line. Fitting a line which best separates between two samples belongs to Module 3. Here we will choose by eye two cuts in the petal plane and estimate the efficiency of the filter by using generated values from fitted models.
Let us fit normal distributions to the reference samples. Then generate 100 000 rows with these distributions, plot the scatter plots with this simulated data, define the fitering cuts and calculate the efficiency of our filter.
import scipy
mean, sd = scipy.stats.norm.fit(df_ref_set['plength'])
df_gen_pl_set = scipy.stats.norm.rvs(mean,sd,100000)
mean, sd = scipy.stats.norm.fit(df_ref_set['pwidth'])
df_gen_pw_set = scipy.stats.norm.rvs(mean,sd,100000)
mean, sd = scipy.stats.norm.fit(df_ref_vir['plength'])
df_gen_pl_vir = scipy.stats.norm.rvs(mean,sd,100000)
mean, sd = scipy.stats.norm.fit(df_ref_vir['pwidth'])
df_gen_pw_vir = scipy.stats.norm.rvs(mean,sd,100000)
mean, sd = scipy.stats.norm.fit(df_ref_ver['plength'])
df_gen_pl_ver = scipy.stats.norm.rvs(mean,sd,100000)
mean, sd = scipy.stats.norm.fit(df_ref_ver['pwidth'])
df_gen_pw_ver = scipy.stats.norm.rvs(mean,sd,100000)
plt.subplot(1,1,1)
plt.scatter(df_gen_pl_set,df_gen_pw_set,c='b',label='Setosa')
plt.scatter(df_gen_pl_vir,df_gen_pw_vir,c='g',label='Virginiga')
plt.scatter(df_gen_pl_ver,df_gen_pw_ver,c='r',label='Versicolor')
plt.xlabel('plength')
plt.ylabel('pwidth')
plt.legend()
plt.show()
Filter choice: plength<2.5 and pwidth<1.0
Filter efficiency:
df_farm = df_farm.sort_values('slength')
df_farm_set = df_farm[df_farm['plength'] < 2.5]
df_farm_set
What is the probability that the farmer's Setosa sample is a just fluctuation of the fitted model?
First on the non-filtered farmer data:
stats.ttest_ind(df_farm['swidth'],df_ref_set['swidth'])
These are not from the same population. What about the filtered data:
stats.ttest_ind(df_farm_set['swidth'],df_ref_set['swidth'])
Clearly from the same population. Some more tests:
stats.ttest_ind(df_farm_set['slength'],df_ref_set['slength'])
stats.ttest_ind(df_farm_set['plength'],df_ref_set['plength'])
Our data analysis cannot confirm the claim of v4setosa. There is absolutely no evidence for the claim in our independent study. We advice the court to judge for not guilty.
(And we sent a very large bill which the looser v4setosa had to pay. v4setosa went bankrott and we cannot sleep anymore...)
I used four hours for this small analysis. To work it out, doing more tests, think through, study here and there, make nice tables and plots, create the poster, I would probably need another 10 hours.
Considering that you may need longer due to less experience, this is what is expected from you for the poster session (about 30h work).