Clustering - K means Clustering
K-means clustering is a popular unsupervised learning
algorithm used to partition a dataset into a set of distinct, non-overlapping
groups (or clusters) based on similarity. The goal is to organize the data into
clusters such that data points within a cluster are more similar to each other
than to those in other clusters. The "K" in K-means represents the
number of clusters to be identified from the data, and it must be specified a
priori. This method is widely used in data mining, pattern recognition, image
analysis, and machine learning for its simplicity and efficiency, especially in
handling large datasets.
How K-means
Clustering Works
1. Initialization:
Choose K initial centroids randomly or based on a heuristic. Centroids are the center points of the clusters.
2. Assignment Step:
Assign each data point to the nearest centroid. The "nearest" is usually determined by the Euclidean distance between the data point and the centroid. After this step, each centroid has a group of data points associated with it, forming a cluster.
3. Update Step:
Recalculate the centroids as the mean of all data points assigned to each cluster's current centroid. This updates the position of the centroid to the center of its cluster.
4. Iteration:
Repeat the assignment and update steps until the centroids no longer change significantly, indicating that the algorithm has converged, or until a specified number of iterations is reached.
K Means Algorithm
The above steps lead to clusters due to the application of Alternating Optimization Technique given below.
Solving Numeric Problem with k-means
Question:
You are given the following dataset representing the coordinates of points in a 2-dimensional space:
[2, 3], [3, 2], [5, 8], [6, 7], [8, 5], [7, 6], [1, 1], [4, 4]
Using the k-means algorithm, perform clustering on this dataset to identify distinct groups of points. Use the initial centroids as the first two points in the dataset. Follow the steps of the k-means algorithm to iteratively assign points to clusters and update centroids until convergence.
- Initialize centroids by selecting the first two data points from the dataset.
- Calculate the squared Euclidean distance between each data point and the centroids.
- Assign each data point to the cluster with the minimum distance from its centroid.
- Update the centroids based on the mean of the data points in each cluster.
- Repeat steps 2 to 4 until convergence, i.e., until there is no movement in centroids between consecutive iterations.
Report the following:
- The initial centroids.
- The intermediate clusters at each iteration.
- The final clusters after convergence.
- The final centroids after convergence.
Sum1
Step 1 : Pick 2 random points as initial centroid.
Centroid 1: [7 6] Centroid 2: [3 2]
Iteration 1 - Squared Euclidean Distance Matrix:
Centroid 1: [7 6] Centroid 2: [3 2]
Sample 1: [2 3] 34 2
Sample 2: [3 2] 32 0
Sample 3: [5 8] 8 40
Sample 4: [6 7] 2 34
Sample 5: [8 5] 2 34
Sample 6: [7 6] 0 32
Sample 7: [1 1] 61 5
Sample 8: [4 4] 13 5
Iteration 1 - Intermediate Clusters:
Cluster 1: [[5 8], [6 7], [8 5], [7 6]]
Centroid 1 = [(5+6+8+7)/4, (8+7+5+6)/4] = [6.5 6.5]
Cluster 2: [[2 3], [3 2], [1 1], [4 4]]
Centroid 2 = [(2+3+1+4)/4, (3+2+1+4)/4] = [2.5 2.5]
Iteration 2 - Squared Euclidean Distance Matrix:
Centroid 1: [6.5 6.5] Centroid 2: [2.5 2.5]
Sample 1: [2 3] 32.5 0.5
Sample 2: [3 2] 32.5 0.5
Sample 3: [5 8] 4.5 36.5
Sample 4: [6 7] 0.5 32.5
Sample 5: [8 5] 4.5 36.5
Sample 6: [7 6] 0.5 32.5
Sample 7: [1 1] 60.5 4.5
Sample 8: [4 4] 12.5 4.5
Iteration 2 - Intermediate Clusters:
Cluster 1: [[5 8], [6 7], [8 5], [7 6]]
Cluster 2: [[2 3], [3 2], [1 1], [4 4]]
Sum2
Step 1 : Pick 2 random points as initial centroid.
Centroid 1: [2 3] Centroid 2: [3 2]
Iteration 1 - Squared Euclidean Distance Matrix:
Centroid 1: [2 3] Centroid 2: [3 2]
Sample 1: [2 3] 0 2
Sample 2: [3 2] 2 0
Sample 3: [5 8] 34 40
Sample 4: [6 7] 32 34
Sample 5: [8 5] 40 34
Sample 6: [7 6] 34 32
Sample 7: [1 1] 5 5
Sample 8: [4 4] 5 5
Iteration 1 - Intermediate Clusters:
Cluster 1: [[2 3], [5 8], [6 7], [1 1], [4 4]]
Cluster 2: [[3 2], [8 5], [7 6]]
Iteration 2 - Squared Euclidean Distance Matrix:
Centroid 1: [3.6 4.6] Centroid 2: [6. 4.33333333]
Sample 1: [2 3] 5.12 17.777778
Sample 2: [3 2] 7.12 14.444444
Sample 3: [5 8] 13.52 14.444444
Sample 4: [6 7] 11.52 7.111111
Sample 5: [8 5] 19.52 4.444444
Sample 6: [7 6] 13.52 3.777778
Sample 7: [1 1] 19.72 36.111111
Sample 8: [4 4] 0.52 4.111111
Iteration 2 - Intermediate Clusters:
Cluster 1: [[2 3], [3 2], [5 8], [1 1], [4 4]]
Cluster 2: [[6 7], [8 5], [7 6]]
Iteration 3 - Squared Euclidean Distance Matrix:
Centroid 1: [3. 3.6] Centroid 2: [7. 6.]
Sample 1: [2 3] 1.36 34.0
Sample 2: [3 2] 2.56 32.0
Sample 3: [5 8] 23.36 8.0
Sample 4: [6 7] 20.56 2.0
Sample 5: [8 5] 26.96 2.0
Sample 6: [7 6] 21.76 0.0
Sample 7: [1 1] 10.76 61.0
Sample 8: [4 4] 1.16 13.0
Iteration 3 - Intermediate Clusters:
Cluster 1: [[2 3], [3 2], [1 1], [4 4]]
Cluster 2: [[5 8], [6 7], [8 5], [7 6]]
Iteration 4 - Squared Euclidean Distance Matrix:
Centroid 1: [2.5 2.5] Centroid 2: [6.5 6.5]
Sample 1: [2 3] 0.5 32.5
Sample 2: [3 2] 0.5 32.5
Sample 3: [5 8] 36.5 4.5
Sample 4: [6 7] 32.5 0.5
Sample 5: [8 5] 36.5 4.5
Sample 6: [7 6] 32.5 0.5
Sample 7: [1 1] 4.5 60.5
Sample 8: [4 4] 4.5 12.5
Iteration 4 - Intermediate Clusters:
Cluster 1: [[2 3], [3 2], [1 1], [4 4]]
Cluster 2: [[5 8], [6 7], [8 5], [7 6]]
Convergence reached. No movement in centroids between consecutive iterations.
Python Code
Key Features of K-means Clustering
K-means is easy to understand and implement, making it one of the most commonly used clustering algorithms.
It is computationally efficient, especially on large datasets, although the computation cost can increase with the number of clusters and iterations.
Suitable for a wide range of applications, including customer segmentation, data compression, and feature learning.
Determining the optimal number of clusters (K) is a critical and often challenging task. Methods like the Elbow Method, Silhouette Analysis, and Gap Statistics are commonly used to estimate K.
The choice of initial centroids can affect the final clustering outcome. Multiple runs with different initializations can help, and algorithms like K-means++ have been developed to improve centroid initialization.
K-means can converge to local minima, which might not be the global optimum. This is another reason to run the algorithm multiple times.
The algorithm assumes that clusters are spherical and equally sized, which might not always be the case. Thus, it may not perform well with clusters of different shapes and densities.
Comments
Post a Comment