Scikit-Learn Library

(back to Index)

Scikit-Learn is the go-to library for machine learning in Python.

Scikit Intro

An Introduction to Scikit Learn
Introduction to Scikit-Learn from Python Data Science Handbook
Scikit-Learn online documentation is very extensive and thorough

Scikit Algorithms Overview and Cheatsheet

Scikit cheat sheet
Another good cheat sheet from datacamp

Setting up Scikit

Setting up scikit-learn

Explore Built in Datasets

Dataset documentation
How to use Scikit-Learn Datasets for Machine Learning

Creating synthetic data

How to Generate Test Datasets in Python with scikit-learn

Creating train/test splits

Train-Test Split for Evaluating Machine Learning Algorithms ](https://www.bitdegree.org/learn/train-test-split)

Checklist

You should know the following

Have Scikit development environment setup
Explore built-in dataset
Generate synthetic dataset

Exercises

A - Basics

A1 - Make sure you have scikit installed

import sklearn
sklearn.__version__
# you should see version output

B - Builtin Datasets

B1 - Print out all datasets that are included in scikit

B2 - Regression dataset: Load boston dataset

Find out all the columns
describe the dataframe
print out sample data
Use describe to understand the data

B3 - Classification dataset: Load IRIS dataset

Find out all the columns
describe the dataframe
print out sample data
Use describe to understand the data

C - Generate Data

C1 - make classification dataset

Use make_blobs function to make classification dataset
visualize the dataset

C2 - Use `make_regression` to make a regression dataset

Use make_regression to make a regression dataset
visualize the dataset

D - train/test split

D1 - Simple train/test split

Create data like this

import pandas as pd
from sklearn.model_selection import train_test_split

tip_data = pd.DataFrame({'bill' : [50.00, 30.00, 60.00, 40.00, 65.00, 20.00, 10.00, 15.00, 25.00, 35.00],
                        'tip' : [12.00, 7.00, 13.00, 8.00, 15.00, 5.00, 2.00, 2.00, 3.00, 4.00]})

x = tip_data[['bill']]
y = tip_data[['tip']]

# Use train_test_split function (20% test split)
x_train,x_test,y_train, y_test = train_test_split (x,y,test_size=0.2)

Print out x_train and x_test
Visually inspect the data, are there any common elements between train and test data?

D2 - test/train split on housing data

Read the house-sales-simplified.csv.

import pandas as pd

house_sales = pd.read_csv(...)

X = extract all columns except `SalePrice`
y = extract `SalePrice` column

Now split the X,y data into training and testing.

Print out the length of train and test datasets.

Since the data is too large to visually inspect, how can we programmatically ensure there are no common elements between train and test

More exercises

Scikit exercises

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scikit-learn-1.md

scikit-learn-1.md

Scikit-Learn Library

Scikit Intro

Scikit Algorithms Overview and Cheatsheet

Setting up Scikit

Explore Built in Datasets

Creating synthetic data

Creating train/test splits

More Reading

Data Leakage

Checklist

Exercises

A - Basics

A1 - Make sure you have scikit installed

B - Builtin Datasets

B1 - Print out all datasets that are included in scikit

B2 - Regression dataset: Load boston dataset

B3 - Classification dataset: Load IRIS dataset

C - Generate Data

C1 - make classification dataset

C2 - Use `make_regression` to make a regression dataset

D - train/test split

D1 - Simple train/test split

D2 - test/train split on housing data

More exercises

Index

Files

scikit-learn-1.md

Latest commit

History

scikit-learn-1.md

File metadata and controls

Scikit-Learn Library

Scikit Intro

Scikit Algorithms Overview and Cheatsheet

Setting up Scikit

Explore Built in Datasets

Creating synthetic data

Creating train/test splits

More Reading

Data Leakage

Checklist

Exercises

A - Basics

A1 - Make sure you have scikit installed

B - Builtin Datasets

B1 - Print out all datasets that are included in scikit

B2 - Regression dataset: Load boston dataset

B3 - Classification dataset: Load IRIS dataset

C - Generate Data

C1 - make classification dataset

C2 - Use make_regression to make a regression dataset

D - train/test split

D1 - Simple train/test split

D2 - test/train split on housing data

More exercises

Index

C2 - Use `make_regression` to make a regression dataset