PCA

Topics

dimensionality reduction

linear algebra

PCA is used to remove correlation from features and perform dimensionality reduction.

Principal component analysis is a fundamental technique for reducing dimensionality in data while preserving its structure. It transforms correlated variables into linearly uncorrelated principal components that encompass most of the information from the original dataset. Useful for: data visualization, feature extraction etc.

Core idea

Given data matrix $X$ with $n$ samples and $m$ features, PCA finds new axes (principal components) that maximize variance. These axes are eigenvectors of covariance matrix $C = \frac{1}{n} X^{T} X$ where $X$ is mean-centered. The covariance matrix can be computed easily if you understand the calculation behind variance of a matrix.

Key steps

Center data by subtracting mean from each feature
Compute covariance matrix of centered data
Perform eigendecomposition of covariance matrix
Sort eigenvectors by eigenvalues in descending order
Select top $k$ eigenvectors as principal components
Project data onto new subspace using $Y = X W$ where $W$ contains selected eigenvectors

Implementation options

# Using numpy
cov_matrix = np.cov(X_centered.T)
eigenvals, eigenvecs = np.linalg.eig(cov_matrix)
 
# project data
P = eigenvecs.T.dot(X_centered.T)
 
print(P.T)

# Using scikit-learn
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_transformed = pca.fit_transform(X) # dimensionality reduced
 
# access values and vectors
print(pca.components_)
print(pca.explained_variance_)

Warning

PCA is sensitive to outliers and assumes linear relationships. For non-linear dimensionality reduction, consider manifold learning techniques.

Altamash Khan

Altamash Khan

PCA

Core idea

Key steps

Implementation options

Backlinks

Altamash Khan

PCA

Core idea

Key steps

Implementation options

Related

Backlinks