Skip to main content

On This Page

LLM Embeddings vs. TF-IDF vs. Bag-of-Words: Scikit-learn Performance Deep Dive

6 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

LLM Embeddings vs TF-IDF vs Bag-of-Words: Which Works Better in Scikit-learn?

Machine learning models require numerical text representations. This analysis compares Bag-of-Words, TF-IDF, and LLM embeddings using the BBC news dataset in scikit-learn.

Why This Matters

While LLM embeddings offer advanced semantic understanding, simpler representations like TF-IDF can be sufficient and more efficient for datasets with clear class boundaries and less noise. The BBC news dataset, with its easily separable categories, demonstrated that TF-IDF combined with SVM achieved 0.987 accuracy, outperforming LLM embeddings in this specific context. Over-reliance on complex models for simple problems can lead to overfitting and diminished performance, highlighting the importance of choosing the right representation for the task.

For more complex, real-world scenarios involving noise, paraphrasing, or slang, LLM embeddings would likely demonstrate superior performance due to their ability to capture deeper semantic nuances. However, for tasks where keyword discriminability is high and computational resources are a concern, traditional methods like TF-IDF offer a compelling balance of performance and efficiency.

Key Insights

  • TF-IDF with SVM achieved 0.987 accuracy on the BBC news classification task (2026).
  • LLM embeddings with SVM achieved the fastest training time at 0.15s for classification (2026).
  • Logistic Regression with TF-IDF offered a strong balance of performance (0.984 accuracy) and speed (0.52s training) for classification (2026).
  • For unsupervised document clustering, LLM embeddings yielded the best Adjusted Rand Index (0.899) and Silhouette Score (0.066) on the BBC dataset (2026).
  • Bag-of-Words is recommended for very simple tasks requiring maximum interpretability or as a baseline model (2026).

Working Examples

Python code demonstrating the setup, generation of Bag-of-Words, TF-IDF, and LLM embeddings, followed by comparative text classification and document clustering using scikit-learn and sentence-transformers.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    classification_report,
    silhouette_score,
    adjusted_rand_score
)
from sklearn.preprocessing import LabelEncoder
from sentence_transformers import SentenceTransformer

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

print("Loading BBC News dataset...")
url = "https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv"
df = pd.read_csv(url)
print(f"Dataset loaded: {len(df)} documents")
print(f"Categories: {df['category'].unique()}")
print(f"\nClass distribution:")
print(df['category'].value_counts())

print("\n" + "="*70)
print("DATA PREPARATION PRIOR TO GENERATING TEXT REPRESENTATIONS")
print("="*70)
texts = df['text'].tolist()
labels = df['category'].tolist()

le = LabelEncoder()
y = le.fit_transform(labels)

X_text_train, X_text_test, y_train, y_test = train_test_split(
    texts,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
print(f"\nTrain set: {len(X_text_train)} | Test set: {len(X_text_test)}")

print("\n[1] Bag-of-Words...")
start = time()
bow_vectorizer = CountVectorizer( max_features=5000,
 min_df=2,
 stop_words='english'
)
X_bow_train = bow_vectorizer.fit_transform(X_text_train)
X_bow_test = bow_vectorizer.transform(X_text_test)
bow_time = time() - start
print(f" Done in {bow_time:.2f}s")
print(f" Shape: {X_bow_train.shape} (documents × vocabulary)")
print(f" Sparsity: {(1 - X_bow_train.nnz / (X_bow_train.shape[0] * X_bow_train.shape[1])) * 100:.1f}%")
print(f" Memory: {X_bow_train.data.nbytes / 1024:.1f} KB")

print("\n[2] TF-IDF...")
start = time()
tfidf_vectorizer = TfidfVectorizer( max_features=5000,
 min_df=2,
 stop_words='english'
)
X_tfidf_train = tfidf_vectorizer.fit_transform(X_text_train)
X_tfidf_test = tfidf_vectorizer.transform(X_text_test)
tfidf_time = time() - start
print(f" Done in {tfidf_time:.2f}s")
print(f" Shape: {X_tfidf_train.shape}")
print(f" Sparsity: {(1 - X_tfidf_train.nnz / (X_tfidf_train.shape[0] * X_tfidf_train.shape[1])) * 100:.1f}%")
print(f" Memory: {X_tfidf_train.data.nbytes / 1024:.1f} KB")

print("\n[3] LLM Embeddings...")
start = time()
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
X_emb_train = embedding_model.encode(
 X_text_train,
 show_progress_bar=True,
 batch_size=32
)
X_emb_test = embedding_model.encode(
 X_text_test,
 show_progress_bar=False,
 batch_size=32
)
emb_time = time() - start
print(f" Done in {emb_time:.2f}s")
print(f" Shape: {X_emb_train.shape} (documents × embedding_dim)")
print(f" Sparsity: 0.0% (dense representation)")
print(f" Memory: {X_emb_train.nbytes / 1024:.1f} KB")

print("\n" + "="*70)
print("COMPARISON 1: SUPERVISED CLASSIFICATION")
print("="*70)

classifiers = {
 'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
 'SVM': SVC(kernel='linear', random_state=42)
}

classification_results = []

representations = {
 'BoW': (X_bow_train, X_bow_test),
 'TF-IDF': (X_tfidf_train, X_tfidf_test),
 'LLM Embeddings': (X_emb_train, X_emb_test)
}

for rep_name, (X_tr, X_te) in representations.items():
 print(f"\nTesting {rep_name}:")
 print("-" * 50)
 for clf_name, clf in classifiers.items():
  start = time()
  clf.fit(X_tr, y_train)
  train_time = time() - start
  
  start = time()
  y_pred = clf.predict(X_te)
  pred_time = time() - start
  
  acc = accuracy_score(y_test, y_pred)
  f1 = f1_score(y_test, y_pred, average='weighted')
  print(f" {clf_name:20s} | Acc: {acc:.3f} | F1: {f1:.3f} | Train: {train_time:.2f}s")
  classification_results.append({
      'Representation': rep_name,
      'Classifier': clf_name,
      'Accuracy': acc,
      'F1-Score': f1,
      'Train Time': train_time,
      'Predict Time': pred_time
  })

results_df = pd.DataFrame(classification_results)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

pivot_acc = results_df.pivot(index='Classifier', columns='Representation', values='Accuracy')
pivot_acc.plot(kind='bar', ax=axes[0], width=0.8)
axes[0].set_title('Classification Accuracy by Representation', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Accuracy')
axes[0].set_xlabel('Classifier')
axes[0].legend(title='Representation')
axes[0].grid(axis='y', alpha=0.3)
axes[0].set_ylim([0.9, 1.0])

pivot_time = results_df.pivot(index='Classifier', columns='Representation', values='Train Time')
pivot_time.plot(kind='bar', ax=axes[1], width=0.8, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
axes[1].set_title('Training Time by Representation', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Time (seconds)')
axes[1].set_xlabel('Classifier')
axes[1].legend(title='Representation')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nBEST PERFORMERS:")
print("-" * 50)
best_acc = results_df.loc[results_df['Accuracy'].idxmax()]
print(f"Best Accuracy: {best_acc['Representation']} + {best_acc['Classifier']} = {best_acc['Accuracy']:.3f}")
fastest = results_df.loc[results_df['Train Time'].idxmin()]
print(f"Fastest Training: {fastest['Representation']} + {fastest['Classifier']} = {fastest['Train Time']:.2f}s")

print("\n" + "="*70)
print("COMPARISON 2: DOCUMENT CLUSTERING")
print("="*70)

all_texts = texts
all_labels = y

print("\nGenerating representations for full dataset...")
X_bow_full = bow_vectorizer.fit_transform(all_texts)
X_tfidf_full = tfidf_vectorizer.fit_transform(all_texts)
X_emb_full = embedding_model.encode(all_texts, show_progress_bar=True, batch_size=32)

n_clusters = len(np.unique(all_labels))
clustering_results = []
representations_full = {
 'BoW': X_bow_full,
 'TF-IDF': X_tfidf_full,
 'LLM Embeddings': X_emb_full
}

for rep_name, X_full in representations_full.items():
 print(f"\nClustering with {rep_name}:")
 start = time()
 kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
 cluster_labels = kmeans.fit_predict(X_full)
 cluster_time = time() - start

 silhouette = silhouette_score(X_full, cluster_labels)
 ari = adjusted_rand_score(all_labels, cluster_labels)
 print(f" Silhouette Score: {silhouette:.3f}")
 print(f" Adjusted Rand Index: {ari:.3f}")
 print(f" Time: {cluster_time:.2f}s")
 clustering_results.append({
     'Representation': rep_name,
     'Silhouette': silhouette,
     'ARI': ari,
     'Time': cluster_time
 })

clustering_df = pd.DataFrame(clustering_results)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

x = np.arange(len(clustering_df))
width = 0.35
axes[0].bar(x - width/2, clustering_df['Silhouette'], width, label='Silhouette', alpha=0.8)
axes[0].bar(x + width/2, clustering_df['ARI'], width, label='Adjusted Rand Index', alpha=0.8)
axes[0].set_xlabel('Representation')
axes[0].set_ylabel('Score')
axes[0].set_title('Clustering Quality Metrics', fontsize=14, fontweight='bold')
axes[0].set_xticks(x)
axes[0].set_xticklabels(clustering_df['Representation'])
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

axes[1].bar(clustering_df['Representation'], clustering_df['Time'], color=['#1f77b4', '#ff7f0e', '#2ca02c'], alpha=0.8)
axes[1].set_xlabel('Representation')
axes[1].set_ylabel('Time (seconds)')
axes[1].set_title('Clustering Computation Time', fontsize=14, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nBEST CLUSTERING PERFORMER:")
print("-" * 50)
best_cluster = clustering_df.loc[clustering_df['ARI'].idxmax()]
print(f"{best_cluster['Representation']}: ARI = {best_cluster['ARI']:.3f}, Silhouette = {best_cluster['Silhouette']:.3f}")

Practical Applications

  • Text Classification (Company/system + behavior): BBC News articles classified into 5 categories using Logistic Regression, Random Forest, and SVM models with different text representations.
  • Document Clustering (Company/system + behavior): BBC News articles clustered into 5 groups using K-Means with Bag-of-Words, TF-IDF, and LLM embeddings.
  • Baseline Modeling (Pitfall + consequence): Using Bag-of-Words as a baseline can be effective but may miss semantic nuances crucial for complex tasks.
  • Representation Selection (Pitfall + consequence): Choosing LLM embeddings for a dataset with highly discriminative keywords (like BBC News categories) can lead to overfitting and suboptimal performance compared to simpler methods like TF-IDF.

References:

Continue reading

Next article

CSS sibling-index() Enables Performant Spiral Scrollytelling

Related Content