Regression 3. Linear Regression - Assumptions(With Python Code)

Linear regression, a foundational statistical method, relies on several key assumptions. Understanding these assumptions is crucial for interpreting the results accurately.

Assumptions

Let's go through each assumption with examples:

1. Linearity:

Assumption: The relationship between the independent variables and the dependent variable is linear.

Example: Let's say you're studying the relationship between temperature and ice cream sales. If a linear regression model is appropriate, an increase in temperature should consistently lead to an increase in ice cream sales at a constant rate.

2. Independence:

Assumption: The residuals (or errors) are independent.

The assumption states that the residuals (the differences between observed and predicted values) should be independent of each other.

This means that the value of one error should not predict the value of another error. In other words, there should be no correlation between the residuals. Violation of this assumption, often referred to as autocorrelation, can lead to biased estimates of the model coefficients.

Example: In a study on the effect of study hours on exam scores, the assumption would be that each student's score is independent of the others. If the students studied together, their scores might be correlated, violating this assumption.

3. Homoscedasticity:

Assumption:

This assumption states that the residuals (the differences between the observed values and the values predicted by the model) should have constant variance across all levels of the independent variables. In other words, the size of the error should not systematically vary as the value of the predictor changes.

For example, in a regression model predicting house prices, homoscedasticity implies that the variability in the prediction errors is the same across all price levels. Whether the house is inexpensive or expensive, the spread of the errors should be consistent.

If heteroscedasticity is detected, it can be addressed through transformations (like log or square root transformations of the dependent variable), using weighted least squares, or by adopting robust regression methods.

4. Normality of Errors:

Assumption: The residuals are normally distributed.

Example: In a regression model predicting house prices from square footage, this assumption means that the differences between the observed house prices and the prices predicted by the model should follow a normal distribution.

EDA on Dependent Variable: If the dependent variable is heavily skewed, this might affect the distribution of residuals. Sometimes, transforming the dependent variable (e.g., using a logarithmic transformation) can help.

Understanding Data Source and Type: Knowledge about the nature of your data can sometimes give clues. For example, count data or bounded data might not naturally lead to normally distributed errors.

5. No or Little Multicollinearity:

Assumption: In multiple regression, the independent variables should not be too highly correlated with each other.

Example: If you’re using both ‘years of education’ and ‘years of professional training’ as independent variables to predict income, these two might be highly correlated (multicollinear), which can distort the results.

Why These Assumptions Matter:

Violation of these assumptions can lead to biased or inaccurate results. For instance, if residuals are not independent (autocorrelation), it can lead to an underestimation of the standard error and false significance tests.

In practice, perfect adherence to these assumptions is rare, but significant deviations can severely impact the model's predictive ability and interpretability.

Checking Assumptions:

Since residuals are calculated after fitting a model, it might seem challenging to check for assumptions involving residuals. Techniques such as residual plots, variance inflation factor (VIF), and normal probability plots are used to check these assumptions.

To interpret the results from the various plots and tests used to check the assumptions of linear regression, follow these guidelines:

1. Scatter Plot (Linearity Check):

How to Interpret: A scatter plot of the independent variable (X) against the dependent variable (Y) should show a roughly linear relationship. Look for a clear trend that isn't curvilinear. If the data points form a straight line or close to it, the linearity assumption is satisfied.

2. Residuals vs Fitted Plot (Homoscedasticity Check):

How to Interpret: This plot shows the residuals on the y_axis and the fitted values on the xaxis. You're looking for a random scatter of points without any discernible pattern. If the residuals display a consistent spread across the range of fitted values (no funnel shape), homoscedasticity is satisfied. A pattern, such as spread increasing or decreasing with the fitted values, suggests heteroscedasticity.

3. QQ Plot (Normality of Residuals Check):

How to Interpret: This plot displays the quantiles of the residuals against the expected quantiles of a normal distribution. If the residuals are normally distributed, the points should lie approximately along the reference line. Deviations from the line suggest deviations from normality.

4. ShapiroWilk Test (Normality of Residuals):

How to Interpret: This is a formal test for normality. It provides a pvalue, where a pvalue greater than a chosen significance level (commonly 0.05) indicates that the residuals are normally distributed. A small pvalue (below 0.05) suggests that the residuals are not normally distributed.

5. Durbin_Watson Statistic (Independence of Residuals):

How to Interpret: The Durbin_Watson statistic ranges from 0 to 4, with a value around 2 suggesting no autocorrelation. Values below 2 indicate positive autocorrelation, and values above 2 suggest negative autocorrelation. Ideally, you want a value as close to 2 as possible.

Example Interpretation:

If your scatter plot shows a clear linear pattern and your residuals plot displays a random spread, you can be confident about the linearity and homoscedasticity of your model.

A QQ plot closely following the reference line, along with a ShapiroWilk test pvalue above 0.05, would confirm the normality of residuals.

A DurbinWatson statistic close to 2 indicates that the residuals are independent, satisfying the independence assumption.

Github Link :

ml-course/linear_regression_assumptions.ipynb at main · lovelynrose/ml-course (github.com)

Python Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy import stats

# Set the random seed for reproducibility
np.random.seed(0)

# Generate independent variable (x) and normally distributed errors (epsilon)
x = np.random.uniform(-10, 10, 100)
epsilon = np.random.normal(0, 5, 100)

# Create a dependent variable (y) with a linear relationship to x
# Here, y = 3x + 10 + error
y = 3 * x + 10 + epsilon

# Create a DataFrame
data = pd.DataFrame({'X': x, 'Y': y})

# Fit a linear regression model
model = ols('Y ~ X', data=data).fit()

# Plot for Linearity and Homoscedasticity
plt.figure(figsize=(12, 6))

# Linearity
plt.subplot(1, 2, 1)
sns.scatterplot(x='X', y='Y', data=data)
plt.title('Linearity Check: Scatter Plot of X vs Y')

# Homoscedasticity
plt.subplot(1, 2, 2)
sns.residplot(x='X', y='Y', data=data, lowess=True)
plt.title('Homoscedasticity Check: Residuals vs Fitted')

plt.tight_layout()
plt.show()

# Normality of Errors
plt.figure(figsize=(6, 6))
sm.qqplot(model.resid, line='s')
plt.title('Normality Check: Q-Q Plot of Residuals')
plt.show()

# Shapiro-Wilk Test for Normality
shapiro_test = stats.shapiro(model.resid)
print("Shapiro-Wilk test:", shapiro_test)

# Durbin-Watson Test for Independence
durbin_watson_stat = sm.stats.stattools.durbin_watson(model.resid)
print("Durbin-Watson statistic:", durbin_watson_stat)

Output:

Shapiro-Wilk test: ShapiroResult(statistic=0.9672574400901794, pvalue=0.01368639711290598) Durbin-Watson statistic: 2.0832252321235343

It's important to remember that these checks are somewhat subjective and should be interpreted in the context of your specific data and research question. Additionally, in practice, it's rare for real world data to perfectly meet all these assumptions, and slight deviations might not always significantly impact your model's validity.

Remember, the assumptions of linear regression primarily concern the properties of the residuals, not the variables themselves. Understanding these assumptions is crucial for correctly applying linear regression and interpreting its results.

Search This Blog

Machine Learning - Simplified