Clustering in R

Clustering in R

Clustering is an unsupervised learning technique used to group similar data points together. R provides several packages and functions to perform clustering. In this tutorial, we'll explore some popular clustering methods: k-means, hierarchical, and DBSCAN.

1. K-means Clustering

K-means tries to partition data into k pre-defined distinct non-overlapping groups (clusters).

Example using kmeans function:

# Simulate some data set.seed(123) data <- rbind(matrix(rnorm(100), ncol=2), matrix(rnorm(100, mean=3), ncol=2)) # Apply k-means clustering clusters <- kmeans(data, centers=2) # Plot plot(data, col=clusters$cluster) points(clusters$centers, col=1:2, pch=8, cex=2) 

2. Hierarchical Clustering

Hierarchical clustering builds a tree of clusters. You can visualize this tree using a dendrogram.

Example using hclust function:

# Compute distance matrix dist_matrix <- dist(data) # Hierarchical clustering h_cluster <- hclust(dist_matrix) # Plot plot(h_cluster) 

To cut the tree into k clusters:

groups <- cutree(h_cluster, k=2) 

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN groups together points that are close to each other based on a distance measure and a minimum number of points.

Example using the dbscan package:

First, install and load the necessary package:

install.packages("dbscan") library(dbscan) 

Then apply DBSCAN:

set.seed(123) db <- dbscan(data, eps=0.5, minPts=5) # Plot plot(data, col=db$cluster) 

4. Determining the Number of Clusters

For k-means, the Elbow method is commonly used:

wss <- numeric(10) for (k in 1:10) { model <- kmeans(data, centers=k) wss[k] <- model$tot.withinss } plot(1:10, wss, type="b", xlab="Number of Clusters", ylab="WSS") 

The 'elbow' of the curve represents an optimal number of clusters (a balance between precision and computational cost).

5. Evaluation

For clustering, evaluation can be challenging, especially if true labels are not known. Silhouette analysis can provide insights into the distance between the resulting clusters. More distant clusters lead to better clusterings.

library(cluster) silhouette_score <- silhouette(groups, dist_matrix) plot(silhouette_score) 

Summary:

Clustering is a powerful tool in unsupervised machine learning. R provides various methods and packages to perform clustering. Always ensure that you're preprocessing your data (e.g., scaling) appropriately and evaluating your clustering results using appropriate metrics or visual methods.

Examples

  1. How to Perform Clustering Analysis in R:

    # Load a sample dataset data(iris) # Select features for clustering features <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")] # Standardize the features (if necessary) standardized_features <- scale(features) 
  2. Unsupervised Learning for Clustering in R:

    # Using unsupervised learning for clustering kmeans_model <- kmeans(standardized_features, centers = 3) clusters_kmeans <- kmeans_model$cluster 
  3. Popular Clustering Packages in R:

    # Popular clustering packages library(cluster) library(factoextra) library(dbscan) 
  4. K-Means Clustering in R:

    # Using k-means clustering kmeans_model <- kmeans(standardized_features, centers = 3) clusters_kmeans <- kmeans_model$cluster 
  5. Hierarchical Clustering in R:

    # Using hierarchical clustering hierarchical_model <- hclust(dist(standardized_features)) clusters_hierarchical <- cutree(hierarchical_model, k = 3) 
  6. DBSCAN Clustering in R:

    # Using DBSCAN clustering dbscan_model <- dbscan(standardized_features, eps = 0.5, minPts = 5) clusters_dbscan <- dbscan_model$cluster 
  7. Agglomerative Clustering in R:

    # Using agglomerative clustering agglomerative_model <- agnes(standardized_features) clusters_agglomerative <- cutree(agglomerative_model, k = 3) 
  8. Comparing Clustering Methods in R:

    # Comparing clustering methods fviz_nbclust(standardized_features, kmeans, method = "silhouette") 
  9. Visualizing Clustering Results in R:

    # Visualizing clustering results fviz_cluster(list(data = standardized_features, cluster = clusters_kmeans)) 

More Tags

launcher reverse-proxy defaultmodelbinder uicontrol weak-references exceljs sed highlighting laravel-mail embedded-linux

More Programming Guides

Other Guides

More Programming Examples