Topic
Principal component analysis is a fundamental technique for reducing dimensionality in data while preserving its structure. It transforms correlated variables into linearly uncorrelated principal components through orthogonal transformation. Useful for: data visualization, feature extraction etc.
Core idea
Given data matrix with samples and features, PCA finds new axes (principal components) that maximize variance. These axes are eigenvectors of covariance matrix where is mean-centered. The covariance matrix can be computed easily if you understand the calculation behind variance of a matrix.
Key steps
- Center data by subtracting mean from each feature
- Compute covariance matrix of centered data
- Perform eigendecomposition of covariance matrix
- Sort eigenvectors by eigenvalues in descending order
- Select top eigenvectors as principal components
- Project data onto new subspace using where contains selected eigenvectors
Implementation options
# Using numpy
cov_matrix = np.cov(X_centered.T)
eigenvals, eigenvecs = np.linalg.eig(cov_matrix)
# project data
P = eigenvecs.T.dot(X_centered.T)
print(P.T)
# Using scikit-learn
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_transformed = pca.fit_transform(X)
# access values and vectors
print(pca.components_)
print(pca.explained_variance_)
Warning
PCA is sensitive to outliers and assumes linear relationships. For non-linear dimensionality reduction, consider manifold learning techniques.