Clustering - K means Clustering

K-means clustering is a popular unsupervised learning algorithm used to partition a dataset into a set of distinct, non-overlapping groups (or clusters) based on similarity. The goal is to organize the data into clusters such that data points within a cluster are more similar to each other than to those in other clusters. The "K" in K-means represents the number of clusters to be identified from the data, and it must be specified a priori. This method is widely used in data mining, pattern recognition, image analysis, and machine learning for its simplicity and efficiency, especially in handling large datasets.

How K-means Clustering Works

K-means clustering follows a straightforward iterative procedure to partition the dataset:

1. Initialization:

Choose K initial centroids randomly or based on a heuristic. Centroids are the center points of the clusters.

2. Assignment Step:

Assign each data point to the nearest centroid. The "nearest" is usually determined by the Euclidean distance between the data point and the centroid. After this step, each centroid has a group of data points associated with it, forming a cluster.

3. Update Step:

Recalculate the centroids as the mean of all data points assigned to each cluster's current centroid. This updates the position of the centroid to the center of its cluster.

4. Iteration:

Repeat the assignment and update steps until the centroids no longer change significantly, indicating that the algorithm has converged, or until a specified number of iterations is reached.

K Means Algorithm

The above steps lead to clusters due to the application of Alternating Optimization Technique given below.

Solving Numeric Problem with k-means

Question:

You are given the following dataset representing the coordinates of points in a 2-dimensional space:

[2, 3], [3, 2], [5, 8], [6, 7], [8, 5], [7, 6], [1, 1], [4, 4]

Using the k-means algorithm, perform clustering on this dataset to identify distinct groups of points. Use the initial centroids as the first two points in the dataset. Follow the steps of the k-means algorithm to iteratively assign points to clusters and update centroids until convergence.

Initialize centroids by selecting the first two data points from the dataset.
Calculate the squared Euclidean distance between each data point and the centroids.
Assign each data point to the cluster with the minimum distance from its centroid.
Update the centroids based on the mean of the data points in each cluster.
Repeat steps 2 to 4 until convergence, i.e., until there is no movement in centroids between consecutive iterations.

Report the following:

The initial centroids.
The intermediate clusters at each iteration.
The final clusters after convergence.
The final centroids after convergence.

Sum1

Step 1 : Pick 2 random points as initial centroid.

Centroid 1: [7 6] Centroid 2: [3 2]

Iteration 1 - Squared Euclidean Distance Matrix:

Centroid 1: [7 6] Centroid 2: [3 2]

Sample 1: [2 3] 34 2

Sample 2: [3 2] 32 0

Sample 3: [5 8] 8 40

Sample 4: [6 7] 2 34

Sample 5: [8 5] 2 34

Sample 6: [7 6] 0 32

Sample 7: [1 1] 61 5

Sample 8: [4 4] 13 5

Iteration 1 - Intermediate Clusters:

Cluster 1: [[5 8], [6 7], [8 5], [7 6]]

Centroid 1 = [(5+6+8+7)/4, (8+7+5+6)/4] = [6.5 6.5]

Cluster 2: [[2 3], [3 2], [1 1], [4 4]]

Centroid 2 = [(2+3+1+4)/4, (3+2+1+4)/4] = [2.5 2.5]

Iteration 2 - Squared Euclidean Distance Matrix:

Centroid 1: [6.5 6.5] Centroid 2: [2.5 2.5]

Sample 1: [2 3] 32.5 0.5

Sample 2: [3 2] 32.5 0.5

Sample 3: [5 8] 4.5 36.5

Sample 4: [6 7] 0.5 32.5

Sample 5: [8 5] 4.5 36.5

Sample 6: [7 6] 0.5 32.5

Sample 7: [1 1] 60.5 4.5

Sample 8: [4 4] 12.5 4.5

Iteration 2 - Intermediate Clusters:

Cluster 1: [[5 8], [6 7], [8 5], [7 6]]

Cluster 2: [[2 3], [3 2], [1 1], [4 4]]

Convergence reached. No movement in centroids between consecutive iterations.

Sum2

Step 1 : Pick 2 random points as initial centroid.

Centroid 1: [2 3] Centroid 2: [3 2]

Iteration 1 - Squared Euclidean Distance Matrix:

Centroid 1: [2 3] Centroid 2: [3 2]

Sample 1: [2 3] 0 2

Sample 2: [3 2] 2 0

Sample 3: [5 8] 34 40

Sample 4: [6 7] 32 34

Sample 5: [8 5] 40 34

Sample 6: [7 6] 34 32

Sample 7: [1 1] 5 5

Sample 8: [4 4] 5 5

Iteration 1 - Intermediate Clusters:

Cluster 1: [[2 3], [5 8], [6 7], [1 1], [4 4]]

Cluster 2: [[3 2], [8 5], [7 6]]

Iteration 2 - Squared Euclidean Distance Matrix:

Centroid 1: [3.6 4.6] Centroid 2: [6. 4.33333333]

Sample 1: [2 3] 5.12 17.777778

Sample 2: [3 2] 7.12 14.444444

Sample 3: [5 8] 13.52 14.444444

Sample 4: [6 7] 11.52 7.111111

Sample 5: [8 5] 19.52 4.444444

Sample 6: [7 6] 13.52 3.777778

Sample 7: [1 1] 19.72 36.111111

Sample 8: [4 4] 0.52 4.111111

Iteration 2 - Intermediate Clusters:

Cluster 1: [[2 3], [3 2], [5 8], [1 1], [4 4]]

Cluster 2: [[6 7], [8 5], [7 6]]

Iteration 3 - Squared Euclidean Distance Matrix:

Centroid 1: [3. 3.6] Centroid 2: [7. 6.]

Sample 1: [2 3] 1.36 34.0

Sample 2: [3 2] 2.56 32.0

Sample 3: [5 8] 23.36 8.0

Sample 4: [6 7] 20.56 2.0

Sample 5: [8 5] 26.96 2.0

Sample 6: [7 6] 21.76 0.0

Sample 7: [1 1] 10.76 61.0

Sample 8: [4 4] 1.16 13.0

Iteration 3 - Intermediate Clusters:

Cluster 1: [[2 3], [3 2], [1 1], [4 4]]

Cluster 2: [[5 8], [6 7], [8 5], [7 6]]

Iteration 4 - Squared Euclidean Distance Matrix:

Centroid 1: [2.5 2.5] Centroid 2: [6.5 6.5]

Sample 1: [2 3] 0.5 32.5

Sample 2: [3 2] 0.5 32.5

Sample 3: [5 8] 36.5 4.5

Sample 4: [6 7] 32.5 0.5

Sample 5: [8 5] 36.5 4.5

Sample 6: [7 6] 32.5 0.5

Sample 7: [1 1] 4.5 60.5

Sample 8: [4 4] 4.5 12.5

Iteration 4 - Intermediate Clusters:

Cluster 1: [[2 3], [3 2], [1 1], [4 4]]

Cluster 2: [[5 8], [6 7], [8 5], [7 6]]

Convergence reached. No movement in centroids between consecutive iterations.

Python Code

import numpy as np
import pandas as pd

def initialize_centroids(X, k):
    """
    Initialize centroids by randomly selecting k data points from X.
    
    Parameters:
        X (numpy.ndarray): Data points array.
        k (int): Number of clusters.
    
    Returns:
        numpy.ndarray: Initial centroids.
    """
    return X[np.random.choice(X.shape[0], k, replace=False)]

def calculate_squared_euclidean_distance(X, centroids):
    """
    Calculate squared Euclidean distance between data points and centroids.
    
    Parameters:
        X (numpy.ndarray): Data points array.
        centroids (numpy.ndarray): Centroids array.
    
    Returns:
        numpy.ndarray: Squared Euclidean distance matrix.
    """
    distances_sq = np.sum((X[:, np.newaxis] - centroids) ** 2, axis=2)
    return distances_sq

def assign_clusters(distances_sq):
    """
    Assign each data point to the cluster with the minimum squared Euclidean distance.
    
    Parameters:
        distances_sq (numpy.ndarray): Squared Euclidean distance matrix.
    
    Returns:
        numpy.ndarray: Cluster labels for each data point.
    """
    return np.argmin(distances_sq, axis=1)

def update_centroids(X, labels, k):
    """
    Recalculate centroids based on the assigned clusters.
    
    Parameters:
        X (numpy.ndarray): Data points array.
        labels (numpy.ndarray): Cluster labels for each data point.
        k (int): Number of clusters.
    
    Returns:
        numpy.ndarray: Updated centroids.
    """
    return np.array([X[labels == i].mean(axis=0) for i in range(k)])

def check_convergence(old_centroids, new_centroids):
    """
    Check if there is any movement in centroids between consecutive iterations.
    
    Parameters:
        old_centroids (numpy.ndarray): Centroids from previous iteration.
        new_centroids (numpy.ndarray): Centroids from current iteration.
    
    Returns:
        bool: True if there is no movement in centroids, False otherwise.
    """
    return np.array_equal(old_centroids, new_centroids)

def kmeans(X, k, filename):
    """
    Perform k-means clustering and write output to a text file.
    
    Parameters:
        X (numpy.ndarray): Data points array.
        k (int): Number of clusters.
        filename (str): Name of the output text file.
    """
    # Step 1: Initialize centroids
    centroids = initialize_centroids(X, k)
    
    iteration = 1
    
    with open(filename, 'w') as f:
        while True:
            # Step 2: Calculate squared Euclidean distance between data points and centroids
            distances_sq = calculate_squared_euclidean_distance(X, centroids)
            
            # Write intermediate distance matrix to file
            f.write(f"Iteration {iteration} - Squared Euclidean Distance Matrix:\n")
            distance_matrix = pd.DataFrame(distances_sq, 
                                           columns=[f"Centroid {i+1}: {centroids[i]}" for i in range(k)], 
                                           index=[f"Sample {i+1}: {X[i]}" for i in range(X.shape[0])])
            f.write(distance_matrix.to_string() + '\n\n')
            
            # Step 3: Assign each data point to the cluster with the minimum distance
            labels = assign_clusters(distances_sq)
            
            # Write intermediate clusters to file
            f.write(f"Iteration {iteration} - Intermediate Clusters:\n")
            for i in range(k):
                cluster_samples = X[labels == i]
                f.write(f"Cluster {i+1}: {cluster_samples}\n")
            f.write('\n')
            
            # Step 4: Update centroids
            new_centroids = update_centroids(X, labels, k)
            
            # Check for convergence
            if check_convergence(centroids, new_centroids):
                f.write("Convergence reached. No movement in centroids between consecutive iterations.")
                break
            
            centroids = new_centroids
            iteration += 1

# Sample data with 2 features and 8 samples
X = np.array([[2, 3],
              [3, 2],
              [5, 8],
              [6, 7],
              [8, 5],
              [7, 6],
              [1, 1],
              [4, 4]])

# Number of clusters
k = 2

# Output filename
output_filename = "kmeans_output.txt"

# Perform k-means clustering and write output to text file
kmeans(X, k, output_filename)

print("Output written to", output_filename)

Key Features of K-means Clustering

- Simplicity:

K-means is easy to understand and implement, making it one of the most commonly used clustering algorithms.

- Efficiency:

It is computationally efficient, especially on large datasets, although the computation cost can increase with the number of clusters and iterations.

- Applicability:

Suitable for a wide range of applications, including customer segmentation, data compression, and feature learning.

Considerations

- Choosing K:

Determining the optimal number of clusters (K) is a critical and often challenging task. Methods like the Elbow Method, Silhouette Analysis, and Gap Statistics are commonly used to estimate K.

- Sensitivity to Initialization:

The choice of initial centroids can affect the final clustering outcome. Multiple runs with different initializations can help, and algorithms like K-means++ have been developed to improve centroid initialization.

- Convergence to Local Minima:

K-means can converge to local minima, which might not be the global optimum. This is another reason to run the algorithm multiple times.

- Assumptions:

The algorithm assumes that clusters are spherical and equally sized, which might not always be the case. Thus, it may not perform well with clusters of different shapes and densities.

Applications

K-means clustering has diverse applications across various domains, such as:

- Market Segmentation: Grouping customers based on purchasing behavior, demographics, or interests to tailor marketing strategies.

- Document Clustering: Organizing articles or documents into topics.

- Image Segmentation: Partitioning an image into segments based on pixel similarity for image compression or object recognition.

- Anomaly Detection: Isolating outliers from normal observations in fraud detection or system health monitoring.

Despite its limitations, K-means remains a powerful tool for exploratory data analysis and unsupervised learning due to its simplicity, efficiency, and the intuitive nature of its clustering process.

Search This Blog

Machine Learning - Simplified