Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions contrib/machine-learning/Principal Component Analysis .md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
### Principal Component Analysis (PCA)

#### Introduction
Principal Component Analysis (PCA) is a powerful statistical technique used in machine learning and data analysis for dimensionality reduction. By transforming a dataset with potentially correlated variables into a set of linearly uncorrelated variables called principal components, PCA simplifies the complexity of high-dimensional data while retaining as much variability as possible.

#### How PCA Works
PCA involves several steps, each contributing to the goal of dimensionality reduction:

1. **Standardize the Data**:
- Ensure the dataset is standardized so that each feature has a mean of zero and a variance of one. Standardization is crucial because PCA is sensitive to the scale of the variables.

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

2. **Covariance Matrix Computation**:
- Compute the covariance matrix to understand how the variables in the dataset vary from the mean with respect to each other.

```python
covariance_matrix = np.cov(X_scaled.T)
```

3. **Eigenvalues and Eigenvectors Calculation**:
- Calculate the eigenvalues and eigenvectors of the covariance matrix. Eigenvectors determine the directions of the new feature space, while eigenvalues determine their magnitude (importance).

```python
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
```

4. **Sort Eigenvalues and Eigenvectors**:
- Sort the eigenvalues and their corresponding eigenvectors in descending order. The eigenvectors corresponding to the largest eigenvalues are the principal components.

```python
idx = np.argsort(eigenvalues)[::-1]
eigenvectors = eigenvectors[:, idx]
eigenvalues = eigenvalues[idx]
```

5. **Principal Components Selection**:
- Select the top \( k \) eigenvectors to form a matrix that will transform the data to a new feature subspace.

```python
k = 2 # for example, selecting the top 2 components
principal_components = eigenvectors[:, :k]
```

6. **Transform the Data**:
- Transform the original dataset to the new feature subspace.

```python
X_pca = np.dot(X_scaled, principal_components)
```

#### Applications of PCA
PCA is extensively used in various fields to simplify data analysis and visualization:

- **Image Compression**: Reducing the dimensionality of image data to store images with less memory.
- **Noise Reduction**: Filtering out noise from data by selecting only the most important components.
- **Data Visualization**: Projecting high-dimensional data to 2D or 3D for easier visualization.
- **Feature Extraction**: Identifying the most significant features in a dataset for use in other machine learning models.

#### Example of PCA in Python
Here’s an example demonstrating PCA using the `scikit-learn` library:

```python
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample data
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=1)
X_pca = pca.fit_transform(X_scaled)

print("Original Data:\n", X)
print("Transformed Data:\n", X_pca)
```

#### Conclusion
Principal Component Analysis (PCA) is an indispensable tool for reducing the dimensionality of datasets while preserving as much variance as possible. It is particularly useful in exploratory data analysis and preprocessing for other machine learning algorithms.