PCA from Scratch
Dimensionality reduction via SVD in pure NumPy, validated against scikit-learn
Introduction
Most datasets have more features than they need. Some features are correlated, some are near-linear combinations of others, and some are just noise. PCA (Principal Component Analysis) finds a new coordinate system where the axes — called principal components — are ordered by how much variance they capture. Projecting data onto the top k components gives you the most information possible in k dimensions.
The notebook implements PCA from scratch using only NumPy, then validates numerically against scikit-learn. The dataset is synthetic: 500 samples, 10 features, 5 of which are informative, 3 are redundant linear combinations, and 2 are pure noise. PCA's job is to find the informative directions and deprioritise the rest.
One thing that catches people off guard: you must standardize the data before running PCA. Without it, features with larger numeric scales dominate the principal components regardless of their actual information content. A feature measured in thousands of dollars will dwarf one measured in single-digit probabilities, even if the probability feature is far more predictive. Standardizing to zero mean and unit variance per feature puts everyone on equal footing before the decomposition runs.
The Math & Implementation
PCA is a six-step process:
- Center: Subtract the column means so the data cloud sits at the origin:
X_c = X - mean(X). - Covariance matrix: Compute
C = X_c^T X_c / (n-1). Each entry C[i,j] is the covariance between feature i and feature j. High off-diagonal values mean those features are correlated and PCA will collapse them into fewer components. - Eigen-decomposition: Factorize C into eigenvectors (the principal directions) and eigenvalues (the variance along each direction). The first eigenvector points toward maximum variance.
- Sort: Order eigenvectors by descending eigenvalue so the first component always captures the most variance.
- Project: Multiply the centered data by the top-k eigenvectors:
Z = X_c @ V_k. The result has shape (n_samples, k). - Reconstruct: Map back to original space with
X_hat = Z @ V_k^T + mean. The difference between X and X_hat is the reconstruction error — information that was dropped.
The implementation uses np.linalg.svd on the centered data matrix directly rather than computing the covariance and then calling np.linalg.eigh. Both routes produce the same eigenvectors, but SVD is numerically more stable because it avoids squaring the data matrix. The relationship between them: the right singular vectors of X_c are the eigenvectors of C, and eigenvalues relate to singular values by lambda_i = sigma_i^2 / (n-1).
The class exposes the same API as sklearn — fit, transform, fit_transform, inverse_transform — plus components_, explained_variance_, explained_variance_ratio_, and singular_values_. This makes side-by-side validation straightforward.
Results & Notebook
Reducing from 10 dimensions to 2: PC1 captures 30.6% of variance, PC2 captures 20.3%, for a combined 51.0%. Despite PCA being completely unsupervised — it never sees the class labels — the 2D projection shows clear separation between all three classes. The informative features happened to align well with the directions of maximum variance, which is not guaranteed in general but is common when the informative features genuinely drive most of the data's spread.
Numerical validation: explained variance, variance ratios, and projections all match sklearn to within 1e-6. Eigenvectors may differ by sign (negating a direction is still the same axis), so the check is done on absolute values of the projections. All three comparisons return True.
The reconstruction error analysis is the most instructive part. At k=2, relative error is 49% — half the information is lost. At k=5, it drops to around 25%. At k=10 (all components), error is effectively zero, confirming the math is self-consistent. The scree plot shows the elbow around PC3, which is a practical signal that 2 or 3 components are the natural choice for this dataset. For downstream ML tasks where accuracy matters more than visualization, the cumulative variance curve suggests 7 components to exceed 95% retained variance.