Regression1. Regression Overview with Python Code


A detailed explanation for Regression can be viewed in the following YouTube video.


Regression vs Classification

A regression task in the context of machine learning and statistics is a type of problem where the goal is to predict a continuous outcome variable based on one or more predictor variables. It's different from classification tasks, where the goal is to predict a discrete label. Regression is used to understand relationships between variables and for predicting trends or future values.

The output to a regression problem is a vector of real numbers.

 Understanding Regression

1. Continuous Outcome Variable: 

This is a variable that can take any value within a range. It can take decimal values and not just integer values. They are the dependent variables.

Example

For instance, predicting temperatures, prices, or distances.

The distance is a real number say 32km. Even though it looks like an integer, it may actually be a real number say 32.0000004 subsuming km and some m and some cm.

2. Predictor Variables: 

Also known as independent variables, these are the inputs or factors that you suspect have an impact on the outcome variable. 

Example

For example, in predicting house prices, predictor variables might include the size of the house, its age, and the number of bedrooms.

3. Modeling the Relationship: 

Regression involves finding a mathematical equation that describes the relationship between the predictor and outcome variables. 

Example

The simplest form of regression is linear regression, which assumes a straight-line relationship.

 Types of Regression

1. Simple Linear Regression: 

Involves one predictor and one outcome variable. The relationship is modeled with a straight line: 

y = mx + b

where y is the outcome, x is the predictor, m is the slope of the line, and b is the y-intercept. They are linear in the co-efficient and x.

 2. Multiple Linear Regression: 

Uses more than one predictor variable. The model looks like: 

b0 - intercept

coefficients : 

x1, x2, ... xn : Predictor Variables
They are linear in the co-efficient and x.

3. Non-Linear Regression: 

For more complex relationships that can't be captured with a straight line, models like polynomial regression or logistic regression (for specific kinds of non-linear relationships) are used.

The polynomial model looks like

The logistic function looks like:


The polynomial model is still linear in the parameters. It is non-linear x.

 Performing a Regression Task

1. Data Collection: 

Gather data on the variables of interest. In our example, we have 2 variable of interest namely x and t.

import numpy as np
import matplotlib.pyplot as plt

# Define the number of samples
num_samples = 100

# Create an array of 100 points between 0 and 1
x = np.linspace(0, 1, num_samples)

# Compute the sine of 2*pi*x
y = np.sin(2 * np.pi * x)

# Add Gaussian noise to y
# Set a mean and a standard deviation for the Gaussian noise
noise_mean = 0
noise_std_dev = 0.1

# Generate Gaussian noise
noise = np.random.normal(noise_mean, noise_std_dev, y.shape)

# Create a new y dataset with noise added
y_noisy = y + noise

# Combine x and y into a single dataset for the original and noisy data
dataset_original = np.column_stack((x, y))
dataset_noisy = np.column_stack((x, y_noisy))

# Plot the original dataset
plt.scatter(x, y, label='Original Dataset')

# Plot the noisy dataset
plt.scatter(x, y_noisy, color='red', label='Noisy Dataset', alpha=0.6)

# Label the axes
plt.xlabel('x')
plt.ylabel('y')

# Title of the plot
plt.title('Original vs Noisy Data')

# Show legend
plt.legend()

# Show grid
plt.grid(True)

# Show the plot on the screen
plt.show()

 2. Exploratory Analysis: 

Understand the data, check for correlations, and prepare it for modeling (handling missing values, encoding categorical variables, etc.).

 3. Model Selection: 

Choose an appropriate regression model based on the nature of the data and the relationship between variables. 

 4. Training the Model: 

Use the collected data to train the model. This involves finding the values of the parameters (like m and b in linear regression) that best fit the data.

import numpy as np
import matplotlib.pyplot as plt

# Assuming x, y, and y_noisy are already defined as in previous steps

# Fit a polynomial of degree 3 (cubic) to the noisy data
coefficients = np.polyfit(x, y_noisy, 3)

# Print the coefficients
print("Coefficients of the polynomial:", coefficients)

# Create a polynomial function using the fitted coefficients
p = np.poly1d(coefficients)

# Generate y values using the polynomial function
y_fit = p(x)

 5. Evaluation: 

Assess the model's performance using metrics like R-squared, Mean Squared Error (MSE), or Root Mean Squared Error (RMSE).

from sklearn.metrics import r2_score, mean_squared_error

# Calculate and print the R² value
r_squared = r2_score(y, y_fit)
print("R² value:", r_squared)


# Calculate MSE and RMSE
mse = mean_squared_error(y, y_fit)
rmse = np.sqrt(mse)
print("MSE:", mse)
print("RMSE:", rmse)

 6. Prediction: 

Use the model to make predictions on new, unseen data.

    import numpy as np


# Two given x values to predict
x_predict1 = 0.523  # Example value, replace with your value
x_predict2 = 0.101 # Example value, replace with your value

# Predicting y values for the given x values
y_predict1 = p(x_predict1)
y_predict2 = p(x_predict2)

# Print the predictions
print("Prediction for x =", x_predict1, "is y =", y_predict1)
print("Prediction for x =", x_predict2, "is y =", y_predict2)

 7. Refinement: 

Based on the model's performance, it might be necessary to return to previous steps, select a different model, or gather more data.

Github Link for Python Code:

The following code helps to understand the concept of regression.

ml-course/polynomial.ipynb at main · lovelynrose/ml-course (github.com)

The following code helps to understand all the steps in performing regression. 

ml-course/regression_basics_with_California_dataset.ipynb at main · lovelynrose/ml-course (github.com)

 Applications

Regression tasks are ubiquitous in real-world scenarios like predicting sales in business, estimating housing prices in real estate, forecasting weather conditions in meteorology, and many others. The key is that the outcome variable we want to predict or understand varies continuously and we believe this variation can be explained by other variables.



Comments

Popular posts from this blog

ANN Series - 10 - Backpropagation of Errors

Naive Bayesian Classifiers - Multinomial, Bernoulli and Gaussian with Solved Examples and Laplace Smoothing

Clustering - K means Clustering