Posts

Showing posts from March, 2024

Cross Entropy Loss

The cross-entropy loss, also known as log loss, plays a crucial role in classification tasks, especially in logistic regression and neural networks. It measures the performance of a classification model whose output is a probability value between 0 and 1 . Cross-entropy loss increases as the predicted probability diverges from the actual label , making it an effective loss function for assessing the similarity between the predicted probability distribution and the actual distribution. The formulae to be used are as follows.

Logistic Regression

 Logistic regression is a statistical method for binary classification. The classes can be considered as Y={0,1}. For a dataset X, we are trying to find P(Y=1|X)  Logistic Regression Workflow Linear Combination(logit) The logit is calculated first as shown in the blog on Softmax. This is a linear combination of the input features given by logit = t = w0 + w1 x1 + w2 x2 + ... + wn xn This can be a value between (-INF, +INF). This value also reflects the log odds of P(Y=1|X). Odds The odds of an event occurring is the ratio between the probability of the event occurring to the probability of the event not occurring. Odds = {P(Y=1|X} / {1 - P(Y=1|X} Log Odds This is the natural logarithm of the odds. In logistic regression we see that the log odds of P(Y=1|X) is the logit(t). The proof can be found in the pdf below. Sigmoid Transformation The logit(t) is passed through a sigmoid function to get a value between 0 and 1 and is interpreted as the probability of getting class Y=1. i.e. P(Y=1|

Softmax Function - First Derivative

 Logits They refer to the raw output produced by a machine learning model. This is before we normalize the data to the expected form. For Example, consider an ANN for classifying into 3 classes. The output layer may have 3 output neurons with a linear activation function. This is a logits. Note this will be a vector of 3 real values.  Softmax function This function is used when we want to interpret the output of a model as the probability for various classes. This is specifically useful in a multi-class classification problem. It is a squashing function that squashes the values to fall between [0,1] and sum to 1 across all classes. It evaluates the probability of choosing a class by using the logit across all the classes. The final predicted class has the highest probability. This is commonly used in ANN as the last layer of a multi-class classification problem. The node with the highest probability represents the chosen class. Probability Distribution A probability distribution satisf

SVM

 

Linear, Non-linear, Kernel Models

  Linear Parametric  Ordinary Least Squares (OLS) and pseudo-inverse matrix methods are linear parametric models. Ordinary Least Squares (OLS): OLS is a linear regression method used for estimating the parameters of a linear model. In a simple linear regression, the model has the form y = mx + b , and OLS aims to find the values of m and b that minimize the sum of squared differences between the observed and predicted values of y . The linearity in this context refers to the linear combination of the model parameters. Pseudo-inverse Matrix Method: The pseudo-inverse matrix method is often used for solving linear systems of equations when the matrix is not square or invertible. In the context of linear regression, the pseudo-inverse is used to find the parameter vector in the equation Y = Xβ + ε , where Y is the response variable, X is the design matrix of predictors, β is the parameter vector, and ε is the error term. The solution involves finding the pseudo-inverse of the matrix

Parametric and Non-parametric Models

Parametric models Parametric models in machine learning refer to a class of models that make assumptions about the function form or the underlying data distribution. These models are characterized by a finite number of parameters, which means that irrespective of the size of the data, the complexity of the model is fixed. The goal of learning in parametric models is to estimate these parameters from the data. Once the parameters are learned, the model can make predictions for new, unseen data.  Here are some key points about parametric models: 1. Fixed Number of Parameters:  The number of parameters is predetermined before the training process begins. This number does not grow as the size of the training data increases. 2. Assumptions About Data Distribution:  Parametric models often make strong assumptions about the form of the data distribution. For example, a linear regression model assumes that the relationship between the dependent and independent variables is linear.   3. S

Clustering - K means Clustering

K-means clustering is a popular unsupervised learning algorithm used to partition a dataset into a set of distinct, non-overlapping groups (or clusters) based on similarity. The goal is to organize the data into clusters such that data points within a cluster are more similar to each other than to those in other clusters. The "K" in K-means represents the number of clusters to be identified from the data, and it must be specified a priori. This method is widely used in data mining, pattern recognition, image analysis, and machine learning for its simplicity and efficiency, especially in handling large datasets.   How K-means Clustering Works   K-means clustering follows a straightforward iterative procedure to partition the dataset:   1. Initialization:  Choose K initial centroids randomly or based on a heuristic. Centroids are the center points of the clusters.   2. Assignment Step:  Assign each data point to the nearest centroid. The "nearest" is usually d