Clustering - Agglomerative clustering

Agglomerative clustering is a bottom-up approach to clustering in which each data point starts in its own cluster and pairs of clusters are merged together based on certain criteria until all points belong to just one cluster. The process involves iteratively merging the most similar clusters until a stopping criterion is met.


The similarity between clusters is typically measured using a distance metric, such as Euclidean distance, and different linkage methods define how this similarity is computed. Common linkage methods include:

Agglomerative clustering is a hierarchical clustering technique where each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. There are different linkage criteria used to determine the merging strategy. Here are some common types of agglomerative clustering based on the linkage criteria:

1. Single Linkage (or Minimum Linkage):

   - In single linkage clustering, the distance between two clusters is defined as the minimum distance between any single data point in the first cluster and any single data point in the second cluster.

2. Complete Linkage (or Maximum Linkage):

   - In complete linkage clustering, the distance between two clusters is defined as the maximum distance between any single data point in the first cluster and any single data point in the second cluster.

3. Average Linkage:

   - Average linkage clustering computes the average distance between all pairs of points in the two clusters.

4. Ward's Linkage:

   - Ward's linkage minimizes the variance when merging two clusters. It chooses the pair of clusters to merge such that the variance within all clusters increases the least.

5. Centroid Linkage:

   - In centroid linkage clustering, the distance between two clusters is defined as the distance between the centroids (means) of the two clusters.


Each of these linkage criteria has its own strengths and weaknesses, and the choice often depends on the characteristics of the data and the specific objectives of the analysis.






Sure, let's outline the general steps for performing agglomerative hierarchical clustering and discuss what a dendrogram is:

Steps for Agglomerative Hierarchical Clustering:

1. Initialization: Start by assigning each data point to its own cluster.

2. Compute Pairwise Distances: Calculate the distance between each pair of clusters. The choice of distance metric depends on the nature of the data and the problem at hand (e.g., Euclidean distance for continuous features).

3. Merge Closest Clusters: Identify the closest pair of clusters based on the chosen distance metric and linkage method. Merge these two clusters into a single cluster.

4. Update Distance Matrix: Recalculate the distances between the new cluster and all other clusters.

5. Repeat: Repeat steps 3 and 4 until only a single cluster containing all data points remains.

6. Dendrogram Construction: Construct a dendrogram to visualize the clustering hierarchy.

Dendrogram:

A dendrogram is a tree-like diagram that illustrates the arrangement of the clusters produced by hierarchical clustering. It displays the sequence of merges that occurred during the clustering process, providing insights into the relationships between data points and clusters.


In a dendrogram:

- The vertical axis represents the distance or dissimilarity between clusters.

- The horizontal axis represents the data points or clusters being merged.

- The height of each fusion (vertical lines) indicates the distance at which the clusters were merged.

- Data points or clusters that are closer together on the horizontal axis are more similar to each other.


Dendrograms are useful for:

- Understanding the hierarchical structure of the data.

- Determining the optimal number of clusters by identifying significant jumps in distance.

- Visualizing the relationships between clusters and individual data points.


Would you like further clarification on any of these steps or concepts?

Agglomerative clustering is widely used in various fields such as machine learning, biology, and social sciences for tasks like pattern recognition, image segmentation, and taxonomy construction. 

Numeric Example

You are given the following dataset representing the coordinates of points in a 2-dimensional space:

[2, 3], [3, 2], [5, 8], [6, 7], [8, 5], [7, 6], [1, 1], [4, 4]

How can we effectively cluster the provided set of 2-dimensional points using agglomerative clustering with various linkage methods (single, complete, average, and ward), considering spatial proximity to form clusters while aiming to minimize intra-cluster variance and maximize inter-cluster separation?

Step 1

Find the Distance matrix with Euclidean distance as the metric.

P1 P2 P3 P4 P5 P6 P7 P8 P1 0.00 P2 1.41 0.00 P3 5.83 6.32 0.00 P4 5.66 5.83 1.41 0.00 P5 6.32 5.83 4.24 2.83 0.00 P6 5.83 5.66 2.83 1.41 1.41 0.00 P7 2.24 2.24 8.06 7.81 8.06 7.81 0.00 P8 2.24 2.24 4.12 3.61 4.12 3.61 4.24 0.00

Step 2 : Choose the smallest value. In the case of a tie, any value can be taken. Let us consider P1, P2.

The first cluster is formed at P1, P2.

The following pdf shows the next step in all the 5 methods.



Step 3:
Find distance between:
(P1, P3 ) and (P2, P3) : 5.83 6.32
Update the matrix with the smaller value for single linkage

C1 P3 P4 P5 P6 P7 P8
C1 0.00
P3 5.83 0.00
P4 5.66 1.41 0.00
P5 5.83 4.24 2.83 0.00
P6 5.66 2.83 1.41 1.41 0.00
P7 2.24 8.06 7.81 8.06 7.81 0.00
P8 2.24 4.12 3.61 4.12 3.61 4.24 0

Update with the larger value for complete linkage

C1 P3 P4 P5 P6 P7 P8
C1 0.00
P3 6.32 0.00
P4 5.83 1.41 0.00
P5 6.32 4.24 2.83 0.00
P6 5.83 2.83 1.41 1.41 0.00
P7 2.24 8.06 7.81 8.06 7.81 0.00
P8 2.24 4.12 3.61 4.12 3.61 4.24 0

Note : C1 is cluster with points (P1, P2)

Continue from Step 2

Step 4:
Find smallest in matrix. Let us take between P3, P4.

Distance between (P3, P4) and (P1, P2)

P1-P3, P1-P4, P2-P3, P2-P4 : 5.83, 5.66, 6.32, 5.83

Smallest : 5.66

C1 C2 P5 P6 P7 P8 C1 0.00 C2 5.66 0.00 P5 5.83 2.83 0.00 P6 5.66 1.41 1.41 0.00 P7 2.24 7.83 8.06 7.81 0.00 P8 2.24 3.61 4.12 3.61 4.24 0

Complete Linkage
C1 C2 P5 P6 P7 P8 C1 0.00 C2 5.66 0.00 P5 5.83 4.24 0.00 P6 5.66 2.83 1.41 0.00 P7 2.24 8.06 8.06 7.81 0.00 P8 2.24 4.12 4.12 3.61 4.24 0

Step 5 :
Single Linkage
Choose: C2, P6

Complete Linkage
Choose : P5, P6

A detailed solution to the problem can be found in the following pdf.

Comments

Popular posts from this blog

ANN Series - 10 - Backpropagation of Errors

Regression 10 - Pseudo-Inverse Matrix Approach - Relationship between MLE and LSA - Derivation

Regression 9 - Maximum Likelihood Estimation(MLE) Approach for Regression - Derivation