version November 2023
This is a teaching on simple data-analysis using SciKit-Learn and presented as a jupyter notebook. It is organised as a Parctical with Questions and Corrections. You may want to try it yourself, or simply read the correction which contains all the needed informations.
The course uses data collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, on penguin species. Artwork by @allison_horst
It goes (rapidely !) through the following concepts:
- basic usage of SciKit-Learn
- data repositories, DOI, Missing values, etc...
- pandas, cvs format, scatter plots
- displaying n-dimensionnal objects
- Supervised/Unsupervised Calssification - PCA - LDA
- training set / test set
- other approaches : KNN, SVM, Random Forests
- quality control: confusion matrix, cross validation,
$k$ -fold validation, jack-knive - Standard Scaling