K-Means Cluster Evaluation with Silhouette Analysis

Clustering models require rigorous evaluation to ensure meaningful group separation. The silhouette score, ranging from −1 to 1, quantifies how well data points fit into their assigned clusters versus neighboring clusters.

Why This Matters

Silhouette analysis bridges the gap between theoretical cluster validity and real-world performance. While ideal models assume convex, well-separated clusters, real data often contains overlapping distributions or high-dimensional noise. Poorly chosen cluster counts (e.g., k = 6 in the penguins dataset) can lead to misleading results, with average silhouette scores dropping to 0.392, reflecting weaker cohesion and separation.

Key Insights

“Silhouette score formula: $ s(i) = \frac{b(i) – a(i)}{\max{a(i), b(i)}} $”, from MachineLearningMastery.com (2025)
“Sagas over ACID for e-commerce”: Not applicable here; instead, silhouette analysis is preferred for iterative clustering like K-means.
“scikit-learn used by researchers and practitioners for silhouette computation and clustering”

Working Example

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt
import numpy as np

# Load and preprocess data
penguins = pd.read_csv('https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/penguins.csv')
penguins = penguins.dropna()
features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = penguins[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Evaluate silhouette scores for k=2 to 6
range_n_clusters = list(range(2, 7))
silhouette_avgs = []

for n_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
    cluster_labels = kmeans.fit_predict(X_scaled)
    sil_avg = silhouette_score(X_scaled, cluster_labels)
    silhouette_avgs.append(sil_avg)
    print(f"For n_clusters = {n_clusters}, average silhouette_score = {sil_avg:.3f}")

# Visualize silhouette plots
fig, axes = plt.subplots(1, len(range_n_clusters), figsize=(25, 5), sharey=False)
for i, n_clusters in enumerate(range_n_clusters):
    ax = axes[i]
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    sil_vals = silhouette_samples(X_scaled, labels)
    sil_avg = silhouette_score(X_scaled, labels)
    
    y_lower = 10
    for j in range(n_clusters):
        ith_sil_vals = sil_vals[labels == j]
        ith_sil_vals.sort()
        size_j = ith_sil_vals.shape[0]
        y_upper = y_lower + size_j
        color = plt.cm.nipy_spectral(float(j) / n_clusters)
        ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_sil_vals, facecolor=color, edgecolor=color, alpha=0.7)
        ax.text(-0.05, y_lower + 0.5 * size_j, str(j))
        y_lower = y_upper + 10
    
    ax.set_title(f"Silhouette Plot for k = {n_clusters}")
    ax.axvline(x=sil_avg, color="red", linestyle="--")
    ax.set_xlabel("Silhouette Coefficient")
    if i == 0:
        ax.set_ylabel("Cluster Label")
    ax.set_xlim([-0.1, 1])
    ax.set_ylim([0, len(X_scaled) + (n_clusters + 1) * 10])

plt.tight_layout()
plt.show()

Practical Applications

Use Case: Marketing segmentation using customer purchase data, where silhouette analysis helps identify optimal cluster counts for targeted campaigns.
Pitfall: Over-reliance on silhouette scores without domain knowledge may misalign cluster interpretations (e.g., k = 2 in the penguins dataset vs. three biological species).

References:

https://machinelearningmastery.com/k-means-cluster-evaluation-with-silhouette-analysis/

On This Page

K-Means Cluster Evaluation with Silhouette Analysis