Regression 3. Linear Regression - Assumptions(With Python Code)
Linear regression, a foundational statistical method, relies on several key assumptions. Understanding these assumptions is crucial for interpreting the results accurately.
Assumptions
Let's go through each assumption with examples:
1. Linearity:
Assumption: The
relationship between the independent variables and the dependent variable is
linear.
Example: Let's
say you're studying the relationship between temperature and ice cream sales.
If a linear regression model is appropriate, an increase in temperature should
consistently lead to an increase in ice cream sales at a constant rate.
2. Independence:
Assumption: The
residuals (or errors) are independent.
The assumption states that the
residuals (the differences between observed and predicted values) should be
independent of each other.
This means that the value of one
error should not predict the value of another error. In other words, there
should be no correlation between the residuals. Violation of this assumption,
often referred to as autocorrelation, can lead to biased estimates of the model
coefficients.
Example: In a
study on the effect of study hours on exam scores, the assumption would be that
each student's score is independent of the others. If the students studied
together, their scores might be correlated, violating this assumption.
3. Homoscedasticity:
Assumption:
This assumption states that the
residuals (the differences between the observed values and the values predicted
by the model) should have constant variance across all levels of the
independent variables. In other words, the size of the error should not systematically
vary as the value of the predictor changes.
For example, in a
regression model predicting house prices, homoscedasticity implies that the
variability in the prediction errors is the same across all price levels.
Whether the house is inexpensive or expensive, the spread of the errors should
be consistent.
If heteroscedasticity is
detected, it can be addressed through transformations (like log or square root
transformations of the dependent variable), using weighted least squares, or by
adopting robust regression methods.
4. Normality of Errors:
Assumption: The
residuals are normally distributed.
Example: In a
regression model predicting house prices from square footage, this assumption
means that the differences between the observed house prices and the prices
predicted by the model should follow a normal distribution.
EDA on Dependent Variable: If the
dependent variable is heavily skewed, this might affect the distribution of
residuals. Sometimes, transforming the dependent variable (e.g., using a
logarithmic transformation) can help.
Understanding Data Source and
Type: Knowledge about the nature of your data can sometimes give clues. For
example, count data or bounded data might not naturally lead to normally
distributed errors.
5. No or Little Multicollinearity:
Assumption: In multiple
regression, the independent variables should not be too highly correlated with
each other.
Example: If
you’re using both ‘years of education’ and ‘years of professional training’ as
independent variables to predict income, these two might be highly correlated
(multicollinear), which can distort the results.
Why These Assumptions Matter:
Violation of these assumptions can lead to
biased or inaccurate results. For instance, if residuals are not independent
(autocorrelation), it can lead to an underestimation of the standard error and
false significance tests.
In practice, perfect adherence to these
assumptions is rare, but significant deviations can severely impact the model's
predictive ability and interpretability.
Checking Assumptions:
Since residuals are calculated after fitting a
model, it might seem challenging to check for assumptions involving residuals. Techniques
such as residual plots, variance inflation factor (VIF), and normal probability
plots are used to check these assumptions.
To interpret the results from the various plots and tests used to check the assumptions of linear regression, follow these guidelines:
1. Scatter Plot (Linearity Check):
How to Interpret: A scatter plot of the independent variable (X) against the dependent variable (Y) should show a roughly linear relationship. Look for a clear trend that isn't curvilinear. If the data points form a straight line or close to it, the linearity assumption is satisfied.
2. Residuals vs Fitted Plot (Homoscedasticity Check):
How to Interpret: This plot shows the residuals on the y_axis and the fitted values on the xaxis. You're looking for a random scatter of points without any discernible pattern. If the residuals display a consistent spread across the range of fitted values (no funnel shape), homoscedasticity is satisfied. A pattern, such as spread increasing or decreasing with the fitted values, suggests heteroscedasticity.
3. QQ Plot (Normality of Residuals Check):
How to Interpret: This plot displays the quantiles of the residuals against the expected quantiles of a normal distribution. If the residuals are normally distributed, the points should lie approximately along the reference line. Deviations from the line suggest deviations from normality.
4. ShapiroWilk Test (Normality of Residuals):
How to Interpret: This is a formal test for normality. It provides a pvalue, where a pvalue greater than a chosen significance level (commonly 0.05) indicates that the residuals are normally distributed. A small pvalue (below 0.05) suggests that the residuals are not normally distributed.
5. Durbin_Watson Statistic (Independence of Residuals):
How to Interpret: The Durbin_Watson statistic ranges from 0 to 4, with a value around 2 suggesting no autocorrelation. Values below 2 indicate positive autocorrelation, and values above 2 suggest negative autocorrelation. Ideally, you want a value as close to 2 as possible.
Example Interpretation:
If your scatter plot shows a clear linear
pattern and your residuals plot displays a random spread, you can be confident
about the linearity and homoscedasticity of your model.
A QQ plot closely following the reference
line, along with a ShapiroWilk test pvalue above 0.05, would confirm the
normality of residuals.
A DurbinWatson statistic close to 2 indicates that the residuals are independent, satisfying the independence assumption.
Github Link :
ml-course/linear_regression_assumptions.ipynb at main · lovelynrose/ml-course (github.com)
Python Code
Output:
It's important to remember that
these checks are somewhat subjective and should be interpreted in the context
of your specific data and research question. Additionally, in practice, it's
rare for real world data to perfectly meet all these assumptions, and slight
deviations might not always significantly impact your model's validity.
Remember, the assumptions of
linear regression primarily concern the properties of the residuals, not the
variables themselves. Understanding these assumptions is crucial for correctly
applying linear regression and interpreting its results.
Comments
Post a Comment