Machine Learning - Simplified

Posts

Constructing Kernels

- April 24, 2024

Properties of Kernel Functions To perform this transformation we need to chose proper kernel functions. Functions that satisfy the following properties make good kernel functions. 1. Symmetry K(x,y) = K(y,x) That is, the similarity measure between x and y is the same as the similarity measure between y and x. The kernel matrix will be a symmetric matrix. 2. Positive Semidefinitiveness The kernel matrix formed must be positive semidefinite. Eigenvalues and Eigenvectors Mercer's Theorem It states that a continuous, symmetric and positive semidefinite function is a valid kernel. It also states that the kernel function corresponds to the dot product(inner product in Euclidean space) in higher dimensional space. Constructing Common Kernel functions The following pdf shows how to construct kernels and check for valid kernels.

Clustering - Agglomerative clustering

- April 24, 2024

Agglomerative clustering is a bottom-up approach to clustering in which each data point starts in its own cluster and pairs of clusters are merged together based on certain criteria until all points belong to just one cluster. The process involves iteratively merging the most similar clusters until a stopping criterion is met. The similarity between clusters is typically measured using a distance metric, such as Euclidean distance, and different linkage methods define how this similarity is computed. Common linkage methods include: Agglomerative clustering is a hierarchical clustering technique where each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. There are different linkage criteria used to determine the merging strategy. Here are some common types of agglomerative clustering based on the linkage criteria: 1. Single Linkage (or Minimum Linkage): - In single linkage clustering, the distance between two clusters is d...

Measures of Similarity and Dissimilarity

- April 23, 2024

All the measures are as given in the pdf.

Cross Entropy Loss

- March 26, 2024

The cross-entropy loss, also known as log loss, plays a crucial role in classification tasks, especially in logistic regression and neural networks. It measures the performance of a classification model whose output is a probability value between 0 and 1 . Cross-entropy loss increases as the predicted probability diverges from the actual label , making it an effective loss function for assessing the similarity between the predicted probability distribution and the actual distribution. The formulae to be used are as follows.

Logistic Regression

- March 25, 2024

Logistic regression is a statistical method for binary classification. The classes can be considered as Y={0,1}. For a dataset X, we are trying to find P(Y=1|X) Logistic Regression Workflow Linear Combination(logit) The logit is calculated first as shown in the blog on Softmax. This is a linear combination of the input features given by logit = t = w0 + w1 x1 + w2 x2 + ... + wn xn This can be a value between (-INF, +INF). This value also reflects the log odds of P(Y=1|X). Odds The odds of an event occurring is the ratio between the probability of the event occurring to the probability of the event not occurring. Odds = {P(Y=1|X} / {1 - P(Y=1|X} Log Odds This is the natural logarithm of the odds. In logistic regression we see that the log odds of P(Y=1|X) is the logit(t). The proof can be found in the pdf below. Sigmoid Transformation The logit(t) is passed through a sigmoid function to get a value between 0 and 1 and is interpreted as the probability of getting class Y=1. i....

Softmax Function - First Derivative

- March 25, 2024

Logits They refer to the raw output produced by a machine learning model. This is before we normalize the data to the expected form. For Example, consider an ANN for classifying into 3 classes. The output layer may have 3 output neurons with a linear activation function. This is a logits. Note this will be a vector of 3 real values. Softmax function This function is used when we want to interpret the output of a model as the probability for various classes. This is specifically useful in a multi-class classification problem. It is a squashing function that squashes the values to fall between [0,1] and sum to 1 across all classes. It evaluates the probability of choosing a class by using the logit across all the classes. The final predicted class has the highest probability. This is commonly used in ANN as the last layer of a multi-class classification problem. The node with the highest probability represents the chosen class. Probability Distribution A probability distributi...

Search This Blog

Machine Learning - Simplified

Posts

Kernel Function and Kernel Trick

Constructing Kernels

Clustering - Agglomerative clustering

Measures of Similarity and Dissimilarity

Cross Entropy Loss

Logistic Regression

Softmax Function - First Derivative