Posts

Kernel Function and Kernel Trick

  Parametric Models They store the parameter only and not the entire training dataset. Memory Based Methods Store entire training data or a subset of the training dataset . They are fast to train, but prediction takes a longer time. Eg. Knn Need for Kernels Not all problems are linearly separable. Also, when a feature vector is transformed to a higher dimensional space, linearly separable problems become non linearly separable. For example, let us consider a point (x,y). In the 2D space it might be non-linearly separable. that is a straight line cannot separate the points. We can add a new dimension that is a combination of the existing dimensions. Let it be z = x^2 + y^2 Now the transformed feature space is (x, y, z) and is of 3 dimensions. In this dimension, z might help to move the data points along x and y in such a way that there is a linear separator between them. That is we move from the original feature space to a transformed feature space . We cannot say how many dimensions

Constructing Kernels

  Properties of Kernel Functions To perform this transformation we need to chose proper kernel functions. Functions that satisfy the following properties make good kernel functions. 1. Symmetry K(x,y) = K(y,x) That is, the similarity measure between x and y is the same as the similarity measure between y and x. The kernel matrix will be a symmetric matrix. 2. Positive Semidefinitiveness The kernel matrix formed must be positive semidefinite. Eigenvalues and Eigenvectors Mercer's Theorem It states that a continuous, symmetric and positive semidefinite function is a valid kernel. It also states that the kernel function corresponds to the dot product(inner product in Euclidean space) in higher dimensional space. Constructing Common Kernel functions The following pdf shows how to construct kernels and check for valid kernels.

Clustering - Agglomerative clustering

Image
Agglomerative clustering is a bottom-up approach to clustering in which each data point starts in its own cluster and pairs of clusters are merged together based on certain criteria until all points belong to just one cluster. The process involves iteratively merging the most similar clusters until a stopping criterion is met. The similarity between clusters is typically measured using a distance metric, such as Euclidean distance, and different linkage methods define how this similarity is computed. Common linkage methods include: Agglomerative clustering is a hierarchical clustering technique where each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. There are different linkage criteria used to determine the merging strategy. Here are some common types of agglomerative clustering based on the linkage criteria: 1. Single Linkage (or Minimum Linkage):    - In single linkage clustering, the distance between two clusters is defined as

Measures of Similarity and Dissimilarity

All the measures are as given in the pdf.

Cross Entropy Loss

The cross-entropy loss, also known as log loss, plays a crucial role in classification tasks, especially in logistic regression and neural networks. It measures the performance of a classification model whose output is a probability value between 0 and 1 . Cross-entropy loss increases as the predicted probability diverges from the actual label , making it an effective loss function for assessing the similarity between the predicted probability distribution and the actual distribution. The formulae to be used are as follows.

Logistic Regression

 Logistic regression is a statistical method for binary classification. The classes can be considered as Y={0,1}. For a dataset X, we are trying to find P(Y=1|X)  Logistic Regression Workflow Linear Combination(logit) The logit is calculated first as shown in the blog on Softmax. This is a linear combination of the input features given by logit = t = w0 + w1 x1 + w2 x2 + ... + wn xn This can be a value between (-INF, +INF). This value also reflects the log odds of P(Y=1|X). Odds The odds of an event occurring is the ratio between the probability of the event occurring to the probability of the event not occurring. Odds = {P(Y=1|X} / {1 - P(Y=1|X} Log Odds This is the natural logarithm of the odds. In logistic regression we see that the log odds of P(Y=1|X) is the logit(t). The proof can be found in the pdf below. Sigmoid Transformation The logit(t) is passed through a sigmoid function to get a value between 0 and 1 and is interpreted as the probability of getting class Y=1. i.e. P(Y=1|

Softmax Function - First Derivative

 Logits They refer to the raw output produced by a machine learning model. This is before we normalize the data to the expected form. For Example, consider an ANN for classifying into 3 classes. The output layer may have 3 output neurons with a linear activation function. This is a logits. Note this will be a vector of 3 real values.  Softmax function This function is used when we want to interpret the output of a model as the probability for various classes. This is specifically useful in a multi-class classification problem. It is a squashing function that squashes the values to fall between [0,1] and sum to 1 across all classes. It evaluates the probability of choosing a class by using the logit across all the classes. The final predicted class has the highest probability. This is commonly used in ANN as the last layer of a multi-class classification problem. The node with the highest probability represents the chosen class. Probability Distribution A probability distribution satisf