For this linear regression example, we will be using the heart disease dataset, which is a public health dataset that can be retrieved from Kaggle.
For this particular example, we will be only using two fields, the trestbps (resting blood pressure in mm/hg) and thalach (maximum heart rate achieved). There isn't much correlation between the data but for demonstration purposes, we will be using them to estimate linear regression using existing scikit libraries and also by using manual calculations in Python.
To calculate for the intercept or the b in y = mx + b, we use the following formula:
- Intercept = [(ΣY)(ΣX2) – (ΣX)(ΣXY)] / [n(ΣX2) – (ΣX)^2]
To calculate for the slope or the m in y = mx + b, we use the following formula:
- Slope = [n(ΣXY) – (ΣX)(ΣY)] / [n(ΣX2) – (ΣX)2]
We then compared our values to what is being calculated in sk-learn.
import matplotlib.pyplot as plt
from scipy import stats
slope, intercept, r, p, std_err = stats.linregress(linear_table["X"], linear_table["Y"])
print(f"y = {slope}x + {intercept}")