Scikit-Learn is the go-to library for machine learning in Python.
- An Introduction to Scikit Learn
- Introduction to Scikit-Learn from Python Data Science Handbook
- Scikit-Learn online documentation is very extensive and thorough
- Train-Test Split for Evaluating Machine Learning Algorithms ](https://www.bitdegree.org/learn/train-test-split)
You should know the following
- Have Scikit development environment setup
- Explore built-in dataset
- Generate synthetic dataset
import sklearn
sklearn.__version__
# you should see version output
- Find out all the columns
- describe the dataframe
- print out sample data
- Use
describe
to understand the data
- Find out all the columns
- describe the dataframe
- print out sample data
- Use
describe
to understand the data
- Use
make_blobs
function to make classification dataset - visualize the dataset
- Use
make_regression
to make a regression dataset - visualize the dataset
Create data like this
import pandas as pd
from sklearn.model_selection import train_test_split
tip_data = pd.DataFrame({'bill' : [50.00, 30.00, 60.00, 40.00, 65.00, 20.00, 10.00, 15.00, 25.00, 35.00],
'tip' : [12.00, 7.00, 13.00, 8.00, 15.00, 5.00, 2.00, 2.00, 3.00, 4.00]})
x = tip_data[['bill']]
y = tip_data[['tip']]
# Use train_test_split function (20% test split)
x_train,x_test,y_train, y_test = train_test_split (x,y,test_size=0.2)
- Print out
x_train
andx_test
- Visually inspect the data, are there any common elements between train and test data?
Read the house-sales-simplified.csv.
import pandas as pd
house_sales = pd.read_csv(...)
X = extract all columns except `SalePrice`
y = extract `SalePrice` column
Now split the X,y data into training and testing.
Print out the length of train and test datasets.
Since the data is too large to visually inspect, how can we programmatically ensure there are no common elements between train and test