Text Features and Dimensionality Reduction

4.3 — Text as Features

A product review contains more predictive signal than any combination of structured metadata columns. The phrase “battery dies after two hours” tells you more about return probability than the product category, price, and seller rating combined. But your model cannot read English — it needs a numeric representation.

Two paradigms dominate text featurization, and they encode fundamentally different assumptions about language.

TF-IDF: The Workhorse and Its Limits

Term Frequency–Inverse Document Frequency weights each word by how often it appears in a document relative to how often it appears across all documents. Words that appear frequently in one document but rarely across the corpus get high weights — they are distinctive to that document. Words that appear everywhere (“the”, “is”, “and”) get near-zero weights.

The representation is a sparse vector: one dimension per unique word (or n-gram) in the vocabulary. A corpus with 50,000 unique words produces 50,000-dimensional vectors, nearly all zeros for any single document.

TF-IDF has three strengths that keep it relevant: it requires no GPU, no pretrained model, and no external dependencies beyond scikit-learn. It is interpretable — you can inspect the highest-weighted features and understand why a document was classified a certain way. And for small datasets (under ~5,000 documents), it often matches or beats embedding-based approaches because dense embeddings need data volume to demonstrate their advantage.

The limitations are equally clear: TF-IDF has no semantic understanding. The phrases “the battery is excellent” and “the battery is terrible” have nearly identical TF-IDF representations — they share almost all the same words. And the vocabulary curse scales with your data: more documents mean more unique words, which means higher-dimensional sparse vectors that consume memory and slow training.

Dense Embeddings: Semantic Representations

Sentence-transformer models compress text into dense vectors of fixed dimensionality — 384 dimensions is typical — where geometric proximity encodes semantic similarity. “Battery dies quickly” and “short battery life” land in nearly the same region of the embedding space, despite sharing zero content words.

The tradeoff is clear: TF-IDF gives you 50,000 interpretable sparse dimensions. Sentence-transformers give you 384 opaque dense dimensions. The dense representation captures meaning; the sparse representation captures vocabulary.

Head-to-Head: TF-IDF vs. Sentence-Transformers

Let’s build both pipelines and compare them on a text classification task:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sentence_transformers import SentenceTransformer


def tfidf_pipeline(
    texts: list[str],
    labels: np.ndarray,
    max_features: int = 20_000,
    ngram_range: tuple[int, int] = (1, 2),
) -> dict[str, float]:
    """TF-IDF + Logistic Regression baseline.

    Returns cross-validated accuracy and the feature dimensionality.
    """
    pipe = Pipeline([
        ("tfidf", TfidfVectorizer(
            max_features=max_features,
            ngram_range=ngram_range,
            strip_accents="unicode",
            min_df=3,
            max_df=0.95,
            sublinear_tf=True,  # apply log(1 + tf), dampens high-frequency terms
        )),
        ("clf", LogisticRegression(
            max_iter=1000,
            C=1.0,
            solver="saga",
            n_jobs=-1,
        )),
    ])
    scores = cross_val_score(pipe, texts, labels, cv=5, scoring="accuracy")
    # Fit once to get vocabulary size
    pipe.fit(texts, labels)
    n_features = len(pipe.named_steps["tfidf"].vocabulary_)

    return {
        "method": "TF-IDF",
        "accuracy_mean": scores.mean(),
        "accuracy_std": scores.std(),
        "n_features": n_features,
    }


def embedding_pipeline(
    texts: list[str],
    labels: np.ndarray,
    model_name: str = "all-MiniLM-L6-v2",
    batch_size: int = 64,
) -> dict[str, float]:
    """Sentence-transformer embeddings + Logistic Regression.

    Returns cross-validated accuracy and the embedding dimensionality.
    """
    model = SentenceTransformer(model_name)
    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        normalize_embeddings=True,
    )

    clf = LogisticRegression(max_iter=1000, C=1.0, solver="saga", n_jobs=-1)
    scores = cross_val_score(clf, embeddings, labels, cv=5, scoring="accuracy")

    return {
        "method": f"Embeddings ({model_name})",
        "accuracy_mean": scores.mean(),
        "accuracy_std": scores.std(),
        "n_features": embeddings.shape[1],
    }


# Compare on synthetic review data
rng = np.random.default_rng(42)
positive_phrases = [
    "excellent product highly recommend",
    "works perfectly love it",
    "great quality fast shipping",
    "amazing value for money",
    "best purchase I have made",
]
negative_phrases = [
    "terrible quality broke immediately",
    "waste of money do not buy",
    "stopped working after a week",
    "poor build quality disappointed",
    "arrived damaged customer service unhelpful",
]

n_samples = 500
texts = []
labels = np.zeros(n_samples * 2, dtype=int)
for i in range(n_samples):
    base_pos = rng.choice(positive_phrases)
    base_neg = rng.choice(negative_phrases)
    # Add noise words to simulate real reviews
    noise = " ".join(rng.choice(["the", "very", "really", "so", "quite"], size=3))
    texts.append(f"{noise} {base_pos} {noise}")
    texts.append(f"{noise} {base_neg} {noise}")
    labels[i * 2] = 1      # positive
    labels[i * 2 + 1] = 0  # negative

tfidf_results = tfidf_pipeline(texts, labels)
print(f"TF-IDF:      accuracy={tfidf_results['accuracy_mean']:.3f} "
      f"± {tfidf_results['accuracy_std']:.3f}  "
      f"features={tfidf_results['n_features']:,}")

embed_results = embedding_pipeline(texts, labels)
print(f"Embeddings:  accuracy={embed_results['accuracy_mean']:.3f} "
      f"± {embed_results['accuracy_std']:.3f}  "
      f"features={embed_results['n_features']:,}")

The dimensionality comparison is stark: TF-IDF might produce 8,000–20,000 features depending on vocabulary; sentence-transformers produce exactly 384. On datasets with fewer than 5,000 documents and clear keyword-driven distinctions, TF-IDF will often match or beat embeddings. On larger datasets with nuanced language — sarcasm, synonyms, negation — embeddings pull ahead decisively.

When TF-IDF Still Wins

Do not default to embeddings out of reflex. TF-IDF is the right choice when:

Your dataset is small. Under ~2,000 documents, embedding models may not have enough data to demonstrate their advantage, and TF-IDF’s simplicity is a virtue.
Interpretability is a requirement. Regulated industries (finance, healthcare) often require model explanations. “The word ‘asbestos’ contributed 0.34 to the risk score” is auditable. “Dimension 217 of the embedding vector contributed 0.02” is not.
You need speed at inference. TF-IDF vectorization is orders of magnitude faster than running text through a transformer. For real-time scoring of millions of documents, the speed difference matters.
Your vocabulary is domain-specific. Pretrained embeddings may underperform on specialized corpora (legal filings, patent claims, medical notes) unless fine-tuned. TF-IDF naturally adapts to any vocabulary.

4.4 — Dimensionality Reduction

Your feature matrix has 5,000 columns. Some are target-encoded categoricals. Some are lag features at different horizons. Some are TF-IDF terms. The instinct is to feed everything to the model and let it sort out what matters.

For tree-based models (gradient-boosted trees, random forests), this instinct is often correct — trees perform implicit feature selection by splitting on the most informative features and ignoring the rest. For linear models, distance-based models (KNN, SVM), and neural networks, high dimensionality is a direct problem.

The Curse of Dimensionality

The phrase “curse of dimensionality” is thrown around without enough precision. Here is the geometric argument that makes it concrete.

Consider a unit hypercube in $d$ dimensions: each side from 0 to 1. If you want a sample of points that “covers” the space — that represents all regions of the feature space well enough for your model to learn local patterns — the number of points you need grows exponentially with $d$.

In 2 dimensions, a 10×10 grid (100 points) covers the unit square reasonably. In 10 dimensions, a 10×10×…×10 grid requires $10^{10}$ = 10 billion points. In 100 dimensions, you need $10^{100}$ points — more than the number of atoms in the observable universe.

With a fixed dataset of, say, 50,000 rows, increasing dimensionality from 50 to 5,000 means each data point is surrounded by vastly more empty space. Distances between points converge — the nearest neighbor and the farthest neighbor become nearly equidistant. Models that rely on distance or density (KNN, kernel SVM, Gaussian mixture models) break down entirely.

The practical consequence: beyond a critical number of features relative to your sample size, adding more features degrades model performance even if those features contain genuine signal. The noise they introduce overwhelms the model’s ability to separate signal from noise.

PCA: Linear Projection

Principal Component Analysis finds the orthogonal directions of maximum variance in your feature space and projects your data onto the top $k$ of those directions. If the first 50 principal components explain 95% of the variance, you can discard the remaining 4,950 dimensions with only 5% information loss.

PCA makes a strong assumption: the directions of maximum variance are the directions of maximum predictive signal. This is often approximately true — features that vary more tend to carry more discriminative information. It fails when the signal lives in dimensions of low variance (rare but important patterns).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score


def pca_compression_pipeline(
    X: np.ndarray,
    y: np.ndarray,
    variance_threshold: float = 0.95,
) -> dict:
    """PCA-based feature compression with explained variance analysis.

    Finds the minimum number of components that explain the target
    variance threshold, then evaluates a downstream classifier.

    Args:
        X: Feature matrix (n_samples, n_features).
        y: Target labels.
        variance_threshold: Cumulative explained variance to retain.

    Returns:
        Dictionary with component count, explained variance, and accuracy.
    """
    # Step 1: Fit full PCA to analyze explained variance
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    pca_full = PCA().fit(X_scaled)
    cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

    # Find minimum components for threshold
    n_components = int(np.searchsorted(cumulative_variance, variance_threshold) + 1)

    # Step 2: Build pipeline with selected components
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=n_components)),
        ("clf", LogisticRegression(max_iter=1000, C=1.0)),
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")

    # Step 3: Compare to full-dimensional model
    full_pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(max_iter=1000, C=1.0)),
    ])
    full_scores = cross_val_score(full_pipe, X, y, cv=5, scoring="accuracy")

    return {
        "original_features": X.shape[1],
        "pca_components": n_components,
        "variance_explained": cumulative_variance[n_components - 1],
        "pca_accuracy": scores.mean(),
        "pca_accuracy_std": scores.std(),
        "full_accuracy": full_scores.mean(),
        "full_accuracy_std": full_scores.std(),
        "compression_ratio": X.shape[1] / n_components,
    }


# Demonstrate: 500 features, but only ~30 carry real signal
rng = np.random.default_rng(42)
n_samples = 2_000
n_informative = 30
n_noise = 470
n_features = n_informative + n_noise

# Informative features: correlated with target
true_weights = rng.standard_normal(n_informative)
X_informative = rng.standard_normal((n_samples, n_informative))
logits = X_informative @ true_weights
y = (logits > np.median(logits)).astype(int)

# Noise features: no relationship to target
X_noise = rng.standard_normal((n_samples, n_noise))
X = np.hstack([X_informative, X_noise])

results = pca_compression_pipeline(X, y, variance_threshold=0.95)

print(f"Original features:     {results['original_features']}")
print(f"PCA components (95%):  {results['pca_components']}")
print(f"Compression ratio:     {results['compression_ratio']:.1f}x")
print(f"Variance explained:    {results['variance_explained']:.3f}")
print(f"PCA accuracy:          {results['pca_accuracy']:.3f} ± {results['pca_accuracy_std']:.3f}")
print(f"Full accuracy:         {results['full_accuracy']:.3f} ± {results['full_accuracy_std']:.3f}")

When PCA compresses 500 features to ~30 components and the accuracy improves, you have direct evidence of the curse of dimensionality in action. The noise features were actively hurting the model, and PCA removed them by projecting onto the variance-rich directions.

The scree plot — cumulative explained variance vs. number of components — is your primary diagnostic tool. Look for an “elbow” where adding more components yields diminishing variance returns. That elbow is your natural dimensionality.

UMAP: Non-Linear Dimensionality Reduction

PCA assumes the important structure in your data is linear. When your data lives on a curved manifold — clusters, spirals, non-linear separations — PCA flattens that structure away.

UMAP (Uniform Manifold Approximation and Projection) preserves the topological structure of high-dimensional data: clusters remain clusters, and the distances between clusters are roughly preserved. It is the standard tool for visualizing high-dimensional data and for creating non-linear features for downstream models.

import numpy as np
from umap import UMAP
from sklearn.datasets import make_classification


def umap_feature_extraction(
    X: np.ndarray,
    y: np.ndarray | None = None,
    n_components: int = 2,
    n_neighbors: int = 15,
    min_dist: float = 0.1,
    metric: str = "euclidean",
    seed: int = 42,
) -> tuple[np.ndarray, UMAP]:
    """Extract UMAP features for visualization or downstream modeling.

    For visualization, use n_components=2.
    For feature extraction, use n_components=5-50 and feed into
    a downstream classifier.

    Args:
        X: High-dimensional feature matrix.
        y: Optional labels (used for supervised UMAP).
        n_components: Output dimensionality.
        n_neighbors: Controls local vs. global structure.
            Small values (5-15) emphasize local clusters.
            Large values (50-200) preserve more global structure.
        min_dist: Controls point packing in the embedding.
            Smaller values create tighter clusters.
        metric: Distance metric for the input space.

    Returns:
        Tuple of (embedded array, fitted UMAP model).
    """
    reducer = UMAP(
        n_components=n_components,
        n_neighbors=n_neighbors,
        min_dist=min_dist,
        metric=metric,
        random_state=seed,
    )

    if y is not None:
        # Supervised UMAP: uses label information to guide the embedding
        embedding = reducer.fit_transform(X, y)
    else:
        embedding = reducer.fit_transform(X)

    return embedding, reducer


# Generate high-dimensional data with cluster structure
X_hd, y_hd = make_classification(
    n_samples=3_000,
    n_features=200,
    n_informative=20,
    n_redundant=30,
    n_clusters_per_class=3,
    n_classes=4,
    random_state=42,
)

# 2D UMAP for visualization
embedding_2d, reducer_2d = umap_feature_extraction(
    X_hd, y_hd, n_components=2, n_neighbors=30
)
print(f"UMAP 2D embedding shape: {embedding_2d.shape}")
print(f"Original shape: {X_hd.shape}")

# Higher-dimensional UMAP as features for a downstream model
embedding_10d, reducer_10d = umap_feature_extraction(
    X_hd, n_components=10, n_neighbors=30
)
print(f"UMAP 10D feature shape:  {embedding_10d.shape}")

t-SNE vs. UMAP

Both t-SNE and UMAP produce 2D visualizations of high-dimensional data. UMAP is generally preferred for three reasons:

Speed. UMAP scales approximately as $O(n \log n)$ versus t-SNE’s $O(n^2)$ (or $O(n \log n)$ with Barnes-Hut approximation, but with larger constants). On 100,000 points, UMAP is 5–10x faster.
Global structure. t-SNE preserves local neighborhoods but distorts global structure — the distances between clusters in a t-SNE plot are meaningless. UMAP better preserves the relative positions of clusters.
Determinism and reproducibility. UMAP with a fixed random_state produces reproducible results. t-SNE implementations vary across runs even with fixed seeds due to numerical sensitivity.

The one case where t-SNE is defensible: when you care exclusively about local neighborhood structure and your dataset is small enough (~10,000 points) that speed is not a concern.

When NOT to Reduce Dimensions

Dimensionality reduction is not always the answer. Tree-based models (XGBoost, LightGBM, CatBoost, random forests) perform implicit feature selection at every split. They evaluate each feature independently and choose the most informative one. High-dimensional sparse features do not cause the same geometric problems for trees as they do for distance-based or linear models.

In practice:

Linear models, KNN, SVM, neural networks: Dimensionality reduction often helps. These models are sensitive to noise features and suffer from the distance convergence problem in high dimensions.
Gradient-boosted trees, random forests: Dimensionality reduction rarely helps and can hurt. By projecting features into principal components, you destroy the interpretable feature boundaries that trees exploit. A tree can split on “zip_code_target_enc > 0.15” — a meaningful decision boundary. It cannot meaningfully split on “PC_37 > 0.42”.
Ensemble approaches: If you are stacking models (Chapter 8), you might use PCA-reduced features for the linear model in your stack and raw features for the tree-based model. Let each model family work with the representation it handles best.

Model Family	Handles High Dims	Benefits from PCA	Benefits from UMAP
Logistic Regression	Poorly (regularization required)	Yes	Sometimes
KNN / SVM (RBF)	Poorly (distance collapse)	Yes	Yes
Gradient-Boosted Trees	Well (implicit selection)	Rarely	Rarely
Neural Networks	Moderately (with dropout)	Sometimes	As preprocessing

The decision framework: if your model is distance-based or linear, and your feature-to-sample ratio exceeds ~1:10, consider PCA. If you need 2D visualization for exploratory analysis, use UMAP. If your model is tree-based and performing well, leave the feature space alone.

Dimensionality Reduction

The comparison above visualizes the same high-dimensional dataset through PCA and UMAP projections. PCA captures the global variance directions; UMAP reveals the cluster structure that PCA’s linear assumptions cannot recover. Neither is universally better — your choice depends on the downstream consumer of the reduced features.

The features and reduction techniques from this chapter — target encoding, temporal features, text embeddings, PCA, UMAP — compose into pipelines. The critical discipline is to validate each feature engineering step independently: check for leakage, measure the signal-to-noise contribution of each new feature, and resist the temptation to add features without evidence that they improve out-of-sample performance. A lean feature matrix with 30 validated features will outperform a bloated one with 3,000 unchecked features, every time.