Simple Linear Regression in Python
Simple Linear Regression in Python
Problem: Predicting sales based on an advertisement on TV, Radio and Newspaper.
Importing Required Libraries
#importing required Library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#model building Library
import statsmodels
import statsmodels.api as sm
import sklearn
from sklearn.model_selection import train_test_split
Read Dataset :
adv=pd.read_csv("https://raw.githubusercontent.com/rkmishracs/dataset/main/advertising.csv")
Check Dataset
adv.head()
The first Three columns(TV, Radio and Newspaper) are predictive variables and Forth(Sales) column is a target variable.
Checking Shape of Dataset
adv.shape
Output
(200, 4)
Dataset detail info
adv.info()
Info will tell us any non null values present in dataset
Describe Data
adv.describe()
Its gives a summary of all statistics.
Data Visualisation: TV Vs Sales
sns.regplot(x='TV', y='Sales', data=adv)
X axis is TV and Y axis is Sales and as per diagram we can see its totally appropriate for linear regression.
Data Visualisation: Radio Vs Sales
sns.regplot(x='Radio', y='Sales', data=adv)
Relationship between Radio and Sales is not good like TV.
Data points are very scatters.
Data Visualisation: Newspaper vs Sales
#Newspaper Vs Sales
sns.regplot(x='Newspaper', y='Sales', data=adv)
Newspaper and Sales relationship is more worst as data points are more scatters towards y-axis
Visualisation:Pairplot->X-Axis(all predictors) and Y-Axis(Target variable)
sns.pairplot(data=adv, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales')
With pairplot we are able to see the comparative view between predictors and target variables:
With First subgraph relationship between TV and Sales is quite high positive correlation but relationship is not much strong between Radio and Sales. With Newspaper having lesser confident with correlation.
Correlation
adv.corr()
Correlation between TV and Sales 0.9 which is very high.
Correlation between Radio and Sales is 0.34 which is lesser than TV and Sales.
Correlation between Newspaper and Sales is 0,15 which is lowest one.
Visualization: Heatmap
sns.heatmap(adv.corr())
In this heatmap you are not able to see the numbers so only you will able to see the color between variable.
Visualization: Heatmap(with Values)
# heatmap with values
sns.heatmap(adv.corr(), annot=True)
Lighter side is more positive and dark side is more negative correlation between variables.
Model Building
Steps:
Create X and y
Create Train and Test sets(70-30, 80-20)
Training the model on training set (i.e. learn the coefficient)
Evaluate the model ( Training set, test set)
- Create X and y
X=adv['TV']
y=adv['Sales']
2. Create Train and Test sets(70-30, 80-20)
X_train, X_test, y_train, y_test= train_test_split(X,y, train_size=0.70, random_state=100)
3.Training the model on training set (i.e. learn the coefficient) Using StatsModels
X_train_sm=sm.add_constant(X_train)
X_train_sm.head()
# Training the model
#fitting the model
lr=sm.OLS(y_train, X_train_sm)
lr_model=lr.fit()
lr_model.params
const 6.948683
TV 0.054546
dtype: float64
y=mx+c
Sales=0.054 *TV +6.94
Model Summary
lr_model.summary()
OLS or Ordinary Least Squares is a useful method for evaluating a linear regression model.
By default, the statsmodels library fits a line on the dataset which passes through the origin. But in order to have an intercept, you need to manually use the add_constant attribute of statsmodels. And once you've added the constant to your X_train dataset, you can go ahead and fit a regression line using the OLS (Ordinary Least Squares) attribute of statsmodels .
Looking at some key statistics from the summary
The values we are concerned with are -
The coefficients and significance (p-values)
R-squared
F statistic and its significance
1. The coefficient for TV is 0.054, with a very low p value
The coefficient is statistically significant. So the association is not purely by chance.
2. R - squared is 0.816
Meaning that 81.6% of the variance in Sales is explained by TV
This is a decent R-squared value.
3. F statistic has a very low p value (practically low)
Meaning that the model fit is statistically significant, and the explained variance isn't purely by chance.
The fit is significant. Let's visualize how well the model fit the data.
From the parameters that we get, our linear regression equation becomes:
𝑆𝑎𝑙𝑒𝑠=6.948+0.054×𝑇𝑉
Scatter Plot on X_train and y_train
plt.scatter(X_train, y_train)
Plotting Model Prediction
plt.scatter(X_train, y_train)
plt.plot(X_train, 6.948+0.054*X_train,'r')
plt.show()
Residual Analysis
To validate assumptions of the model, and hence the reliability for inference
Distribution of the error terms
We need to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.
# error=f(y_train, y_train_pred)
y_train_pred=lr_model.predict(X_train_sm)
y_train_pred
Calculating Residual
residual=y_train-y_train_pred
residual
Plot the Residuals Histogram
plt.figure()
sns.displot(residual)
plt.title("Residual Plot")
The residuals are following the normally distributed with a mean 0. All good!
Plot Residual scatter plot
#Plotting residuals
plt.scatter(X_train, residual)
plt.show()
We can see that residuals are equally distributed which is quite good for the model.
Prediction and Evaluation of Model on Test Dataset
Now we have fitted a regression line on your train dataset, it's time to make some predictions on the test data. For this, you first need to add a constant to the X_test data like you did for X_train and then you can simply go on and predict the y values corresponding to X_test using the predict attribute of the fitted regression line.
# Add a constant to X_test
X_test_sm = sm.add_constant(X_test)
# Predict the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_sm)
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
#RMSE
np.sqrt(mean_squared_error(y_test, y_pred))
Output
2.019296008966232
R-squared on the test set
r_squared = r2_score(y_test, y_pred)
r_squared
Output
0.792103160124566
Visualizing the fit on the test set
plt.scatter(X_test, y_test)
plt.plot(X_test, 6.948 + 0.054 * X_test, 'r')
plt.show()
Linear Regression using linear_model in sklearn
Apart from statsmodels, there is another package namely sklearn that can be used to perform linear regression. We will use the linear_model library from sklearn to build the model. Since, we hae already performed a train-test split, we don't need to do it again.
There's one small step that we need to add, though. When there's only a single feature, we need to add an additional column in order for the linear regression fit to be performed successfully.
from sklearn.model_selection import train_test_split
X_train_lm, X_test_lm, y_train_lm, y_test_lm = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)
X_train_lm.shape
output
(140,)
X_train_lm = X_train_lm.values.reshape(-1,1)
X_test_lm = X_test_lm.values.reshape(-1,1)
print(X_train_lm.shape)
print(y_train_lm.shape)
print(X_test_lm.shape)
print(y_test_lm.shape)
output
(140, 1)
(140,)
(60, 1)
(60,)
from sklearn.linear_model import LinearRegression
# Representing LinearRegression as lr(Creating LinearRegression Object)
lm = LinearRegression()
# Fit the model using lr.fit()
lm.fit(X_train_lm, y_train_lm)
print(lm.intercept_)
print(lm.coef_)
Output
6.948683200001357
[0.05454575]
The equationwe get is the same as what we got before!
Sales=6.948+0.054∗TV
Sales=6.948+0.054∗TV
Sklearn linear model is useful as it is compatible with a lot of sklearn utilites (cross validation, grid search etc.)
Scaling Methods
Min-Max Scaling
Standard Scaling
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)
SciKit Learn has these scaling utilities
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# One aspect that you need to take care of is that the 'fit_transform' can be performed on 2D arrays only. So you need to
# reshape your 'X_train_scaled' and 'y_trained_scaled' data in order to perform the standardisation.
X_train_scaled = X_train.values.reshape(-1,1)
y_train_scaled = y_train.values.reshape(-1,1)
X_train_scaled.shape
output
(140, 1)
# Create a scaler object using StandardScaler()
scaler = StandardScaler()
#'Fit' and transform the train set; and transform using the fit on the test set later
X_train_scaled = scaler.fit_transform(X_train_scaled)
y_train_scaled = scaler.fit_transform(y_train_scaled)
print("mean and sd for X_train_scaled:", np.mean(X_train_scaled), np.std(X_train_scaled))
print("mean and sd for y_train_scaled:", np.mean(y_train_scaled), np.std(y_train_scaled))
output
mean and sd for X_train_scaled: 2.5376526277146434e-17 0.9999999999999999
mean and sd for y_train_scaled: -2.5376526277146434e-16 1.0
# Let's fit the regression line following exactly the same steps as done before
X_train_scaled = sm.add_constant(X_train_scaled)
lr_scaled = sm.OLS(y_train_scaled, X_train_scaled).fit()
# Check the parameters
lr_scaled.params
Output
array([-2.44596010e-16, 9.03212773e-01])
As you might notice, the value of the parameters have changed since we have changed the scale.
Let's look at the statistics of the model, to see if any other aspect of the model has changed.