+1(781)975-1541
support-global@metwarebio.com

Understanding K-means Clustering: A Comprehensive Guide

What is K-means Clustering?

K-means, proposed by Stuart Lloyd in 1957, is one of the most widely used unsupervised learning algorithms. It iteratively partitions data into ‘K’ non-overlapping clusters, maximizing intra-cluster similarity and inter-cluster differences. Here, ‘K’ denotes the predefined number of clusters, and "means" refers to the centroid (mean) of data points within each cluster.

The phrase "birds of a feather flock together" aptly describes clustering. However, it is crucial to distinguish clustering from classification:

  • Classification involves predefined labels (supervised learning), where data is assigned to known categories.
  • Clustering identifies hidden patterns in unlabeled data (unsupervised learning), grouping similar data points without prior knowledge of categories.

 

Applications of K-means Clustering

K-means is versatile across various domains:

Gene Expression Analysis: Clustering co-expressed genes to identify functional modules (e.g., cancer-related gene clusters). For example, RNA-seq data can group genes with similar expression patterns for GO/KEGG enrichment analysis.

Single-Cell Sequencing: Classifying cell subtypes (e.g., T cells, B cells) and annotating them. Tools like t-SNE or UMAP are often integrated for dimensionality reduction and visualizing cellular heterogeneity.

Proteomics: Clustering protein sequences or structures to predict conserved functional regions. Case studies include classifying proteins based on physicochemical properties (hydrophobicity, charge) to aid functional annotation.

Disease Subtyping: Stratifying patients into subtypes using multi-omics data (genomics, transcriptomics) to guide personalized therapies.

 

The K-means Algorithm: Step-by-Step

The algorithm consists of two core steps: Assignment and Update.

1. Initialize Centroids: Randomly select ‘K’ data points as initial centroids.

2. Assignment Step: Assign each data point to the nearest centroid using Euclidean distance.

3. Update Step: Recalculate centroids as the mean of all points in each cluster.

4. Iterate: Repeat assignment and update until centroids stabilize (convergence) or a maximum iteration threshold is reached.

Visualizing Iterations: Initially, centroids (black dots) are randomly placed. As iterations proceed, centroids shift, and cluster assignments (color-coded) adjust until convergence.

Iterative process for K-means analysis

Iterative process for K-means analysis

 

Example: Clustering Data

Consider 2D data points representing customer "spending" and "visit frequency." We aim to cluster them into *K=2* groups.

Data Point

Spending (x)

Visits (y)

Point 1

1

2

Point 2

2

1

Point 3

4

5

Point 4

5

4

Point 5

8

8

 

Step 1: Initialize Centroids

Randomly select Point 1 (1,2) and Point 5 (8,8) as initial centroids:

  • C₁: (1, 2)
  • C₂: (8, 8)

Step 2: Assign Clusters

Calculate Euclidean distances and assign points to the nearest centroid:

Point

Distance to C1

Distance to C2

Centroids

Cluster

1

0

8.49

C1

C1

2

1.41

8.06

C1

C1

3

4.24

5

C1

C1

4

5

4.24

C2

C2

5

8.49

0

C2

C2

 

Result:

  • Cluster 1: Points 1, 2, 3
  • Cluster 2: Points 4, 5

Step 3: Recalculate Centroids

  • New C1: ( (1+2+4)/3, (2+1+5)/3 ) = (2.33, 2.67)
  • New C2: ( (5+8)/2, (4+8)/2 ) = (6.5, 6.0)

Step 4: Iterate Until Convergence

Repeat assignment and centroid updates until no further changes occur. Final clusters remain unchanged, with stable centroids at (2.33, 2.67) and (6.5, 6.0).

 

Choosing the Optimal K Value

Selecting K is critical but non-trivial. Common methods include:

1. Elbow Method: Plot the sum of squared errors (SSE) against ‘K’. The "elbow" (sharp decline in SSE slope) indicates the optimal ‘K’.

2. Silhouette Coefficient: Measures cluster compactness and separation. Higher values denote better clustering.

3. Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance. Maximize this value.

 

K-means Analysis & Visualization

1. Environment Setup

#Install required R packages:

if (!requireNamespace("factoextra", quietly = TRUE)) {
install.packages("factoextra")
}
if (!requireNamespace("cluster", quietly = TRUE)) {
install.packages("cluster")

2. Data Preparation

Use the `USArrests` dataset (crime statistics by U.S. state):

library(factoextra)

library(cluster)

data <- scale(USArrests)

data <- na.omit(data)

data <- scale(data)
head(data)
 Murder Assault UrbanPop Rape
Alabama1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska 0.50786248 1.1068225 -1.21176422.484202941
Arizona0.07163341 1.47880320.99898011.042878388
Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.26281441.75892342.067820292
Colorado 0.02571456 0.39885930.86080851.864967207

3. Determine Optimal ‘K’

Visualize SSE using the Elbow Method:

fviz_nbclust(scale(data), kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2) + labs(x="K", y="SSE")

The elbow at K=4 suggests optimal clustering.

Optimal number of clusters

Optimal number of clusters

 

4. Perform K-means Clustering

Finally, we perform the kmeans function, using k = 4 as the optimal value:

set.seed(1)
km <- kmeans(df, centers = 4, nstart = 25)
print(km)
K-means clustering with 4 clusters of sizes 13, 16, 13, 8
Cluster means:
MurderAssault UrbanPopRape
10.69507011.03944140.72263701.27693964
2 -0.4894375 -0.38260010.5758298 -0.26165379
3 -0.9615407 -1.1066010 -0.9301069 -0.96676331
41.41188980.8743346 -0.81452110.01927104

Clustering vector:
 Alabama AlaskaArizona Arkansas California
 41141
 Colorado ConnecticutDelaware Florida Georgia
1 2 2 1 4
HawaiiIdaho IllinoisIndiana Iowa
 23123
......
......
Within cluster sum of squares by cluster:
[1] 19.922437 16.212213 11.9524638.316061
 (between_SS / total_SS =71.2 %)

Key Outputs:

  • Cluster sizes: 13, 16, 13, 8
  • Cluster means (centroids) for Murder, Assault, UrbanPop, and Rape.

5. Merging of Classification Results

After general K-means analysis, you can merge the clustering results with the original data to see the trend of the data.

# result <- cbind(Cluster = km$cluster, data)
head(result, 10L)
 Cluster Murder Assault UrbanPop Rape
Alabama4 13.2 236 58 21.2
Alaska 1 10.0 263 48 44.5
Arizona18.1 294 80 31.0
Arkansas 48.8 190 50 19.5
California 19.0 276 91 40.6
Colorado 17.9 204 78 38.7
Connecticut33.3 110 77 11.1
Delaware 35.9 238 72 15.8
Florida1 15.4 335 80 31.9
Georgia4 17.4 211 60 25.8

6. Visualize Clusters

We can present the results as a scatterplot in two dimensions with the fviz_cluster() function:

fviz_cluster(km, data = data,
 palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
 ellipse.type = "convex",
 star.plot = TRUE,
 repel = TRUE,
 ggtheme = theme_minimal()
)

Clustering Visualization

Clustering Visualization

 

Interpretation:

  • Dim1 and Dim2 represent principal components explaining variance.
  • Colors denote clusters, and centroids reflect their mean positions.
  • Tight convex hulls indicate high intra-cluster cohesion.

Limitations & Enhancements

While K-means is fast and interpretable, it has drawbacks:

  • Sensitivity to initial centroids and outliers.
  • Requires numerical data and predefined ‘K’.
  • Potential for imbalanced clusters.

Improvements include robust centroid initialization (e.g., K-means++), outlier detection, and hybrid approaches.

 

Metware Cloud: Simplify Your Analysis

Metware Cloud offers a one-stop bioinformatics platform with zero-code tools for K-means clustering. Whether analyzing gene expression or single-cell data, our engine delivers rapid results: upload data, standardize, cluster, and visualize—all in 30 seconds.

Explore hidden patterns effortlessly at https://cloud.metwarebio.com

Metware Cloud Platform Tools for K-means clustering

Metware Cloud Platform Tools for K-means clustering

WHAT'S NEXT IN OMICS: THE METABOLOME

Please submit a detailed description of your project. We will provide you with a customized project plan metabolomics services to meet your research requests. You can also send emails directly to support-global@metwarebio.com for inquiries.
Name can't be empty
Email error!
Message can't be empty
CONTACT FOR DEMO
+1(781)975-1541
LET'S STAY IN TOUCH
submit
Copyright © Metware Biotechnology Inc. All Rights Reserved.
support-global@metwarebio.com +1(781)975-1541
8A Henshaw Street, Woburn, MA 01801
Contact Us Now
Name can't be empty
Email error!
Message can't be empty