Vectors, Dimensions, and Feature Spaces: The Geometric Foundation of Machine Learning

Vectors, Dimensions, and Feature Spaces — The Geometry Behind Machine Learning

Samuel Akopyan defines machine learning as the process of representing real-world objects as numbers to be processed mathematically. A vector serves as an ordered set of numbers where each element represents a specific aspect of an object, such as a user defined by age, purchase count, and order value.

Why This Matters

In technical production, feature engineering transforms diverse data types like strings and dates into pure mathematical coordinates within a fixed-dimensional space. While formal linear algebra provides the theory, developers must treat vectors as strict contracts; failing to maintain consistent dimensionality or failing to scale features results in models that are dominated by noise or arbitrary numeric ranges rather than informative signals.

Key Insights

Vector dimensionality represents a fixed contract where a model expecting 10 features must receive exactly 10 ordered numbers to maintain geometric integrity.
Feature scaling is a practical necessity because machine learning algorithms are sensitive to numeric scales; large values can dominate and distort the contribution of informative features.
Categorical data requires transformation via one-hot encoding, which converts a single logical feature into multiple numeric coordinates, rapidly increasing space dimensionality.
The ‘curse of dimensionality’ occurs in high-dimensional spaces where the volume grows exponentially and points become sparse, making distances between them less meaningful.
Linear models function by splitting feature space with a hyperplane, where the sign of the linear function determines the classification of an object.

Working Examples

A basic vector representation of a user in PHP.

$userVector = [35, 12, 78.5];

Enforcing dimensionality constraints in a prediction function.

function predict(array $features): float { if (count($features) !== 10) { throw new InvalidArgumentException("Expected a vector of dimensionality 10"); } /* further computations */ }

Normalizing a feature to a range of 0 to 1.

function normalize(float $value, float $min, float $max): float { $range = $max - $min; if ($range === 0.0) { return 0.0; } return ($value - $min) / $range; }

Standardizing features to have zero mean and unit standard deviation.

function standardize(float $value, float $mean, float $std): float { if ($std == 0.0) { return 0.0; } return ($value - $mean) / $std; }

A linear model implementation computing a dot product with a bias.

function linearModel(array $x, array $w, float $b): float { $n = count($x); if ($n !== count($w)) { throw new InvalidArgumentException('Arguments x and w must have the same length'); } $sum = $b; for ($i = 0; $i < $n; $i++) { $sum += $x[$i] * $w[$i]; } return $sum; }

Practical Applications

Use Case: Online store user profiling where vectors store age, purchases, and order value. Pitfall: Swapping the order of vector elements, which causes the model to misinterpret the data.
Use Case: k-Nearest Neighbors (k-NN) classification based on Euclidean distance. Pitfall: Neglecting feature scaling, which causes features with larger numeric ranges to dominate the distance calculation.
Use Case: High-dimensional text embeddings compared via cosine similarity. Pitfall: Using magnitude-based metrics rather than directional similarity, leading to inaccurate results in sparse spaces.

References:

On This Page

Vectors, Dimensions, and Feature Spaces — The Geometry Behind Machine Learning

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Building an End-to-End Data Engineering and Machine Learning Pipeline with PySpark in Google Colab

7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings

From Text to Tables: Feature Engineering with LLMs for Tabular Data