K-Means Cluster Evaluation with Silhouette Analysis
These articles are AI-generated summaries. Please check the original sources for full details.
K-Means Cluster Evaluation with Silhouette Analysis
Clustering models require rigorous evaluation to ensure meaningful group separation. The silhouette score, ranging from −1 to 1, quantifies how well data points fit into their assigned clusters versus neighboring clusters.
Why This Matters
Silhouette analysis bridges the gap between theoretical cluster validity and real-world performance. While ideal models assume convex, well-separated clusters, real data often contains overlapping distributions or high-dimensional noise. Poorly chosen cluster counts (e.g., k = 6 in the penguins dataset) can lead to misleading results, with average silhouette scores dropping to 0.392, reflecting weaker cohesion and separation.
Key Insights
- “Silhouette score formula: $ s(i) = \frac{b(i) – a(i)}{\max{a(i), b(i)}} $”, from MachineLearningMastery.com (2025)
- “Sagas over ACID for e-commerce”: Not applicable here; instead, silhouette analysis is preferred for iterative clustering like K-means.
- “scikit-learn used by researchers and practitioners for silhouette computation and clustering”
Working Example
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt
import numpy as np
# Load and preprocess data
penguins = pd.read_csv('https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/penguins.csv')
penguins = penguins.dropna()
features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = penguins[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Evaluate silhouette scores for k=2 to 6
range_n_clusters = list(range(2, 7))
silhouette_avgs = []
for n_clusters in range_n_clusters:
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)
sil_avg = silhouette_score(X_scaled, cluster_labels)
silhouette_avgs.append(sil_avg)
print(f"For n_clusters = {n_clusters}, average silhouette_score = {sil_avg:.3f}")
# Visualize silhouette plots
fig, axes = plt.subplots(1, len(range_n_clusters), figsize=(25, 5), sharey=False)
for i, n_clusters in enumerate(range_n_clusters):
ax = axes[i]
kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
labels = kmeans.fit_predict(X_scaled)
sil_vals = silhouette_samples(X_scaled, labels)
sil_avg = silhouette_score(X_scaled, labels)
y_lower = 10
for j in range(n_clusters):
ith_sil_vals = sil_vals[labels == j]
ith_sil_vals.sort()
size_j = ith_sil_vals.shape[0]
y_upper = y_lower + size_j
color = plt.cm.nipy_spectral(float(j) / n_clusters)
ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_sil_vals, facecolor=color, edgecolor=color, alpha=0.7)
ax.text(-0.05, y_lower + 0.5 * size_j, str(j))
y_lower = y_upper + 10
ax.set_title(f"Silhouette Plot for k = {n_clusters}")
ax.axvline(x=sil_avg, color="red", linestyle="--")
ax.set_xlabel("Silhouette Coefficient")
if i == 0:
ax.set_ylabel("Cluster Label")
ax.set_xlim([-0.1, 1])
ax.set_ylim([0, len(X_scaled) + (n_clusters + 1) * 10])
plt.tight_layout()
plt.show()
Practical Applications
- Use Case: Marketing segmentation using customer purchase data, where silhouette analysis helps identify optimal cluster counts for targeted campaigns.
- Pitfall: Over-reliance on silhouette scores without domain knowledge may misalign cluster interpretations (e.g., k = 2 in the penguins dataset vs. three biological species).
References:
Continue reading
Next article
Microsoft Copilot Fall Release Includes Collaboration and Personalization Features
Related Content
Building a Single-Cell RNA-seq Analysis Pipeline with Scanpy: From PBMC Clustering to Trajectory Discovery
Learn to build a complete single-cell RNA-seq pipeline using Scanpy for PBMC analysis, covering quality control, doublet detection with Scrublet, and lineage trajectory discovery on benchmark datasets.
Advanced SHAP Workflows for Machine Learning Explainability: A Comprehensive Coding Guide
Implementing SHAP workflows to compare explainers and detect data drift, showing TreeExplainer's speed advantage for interpreting complex machine learning models.
Building an End-to-End Data Engineering and Machine Learning Pipeline with PySpark in Google Colab
A step-by-step guide to using PySpark in Google Colab for data transformations, SQL analytics, feature engineering, and machine learning model training.