Skip to content
/ dbscan Public
forked from mhahsler/dbscan

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package

License

Notifications You must be signed in to change notification settings

Joezzr/dbscan

 
 

Repository files navigation

dbscan - Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package

CRAN version CRAN RStudio mirror downloads Travis-CI Build Status AppVeyor Build Status

This R package provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:

Clustering

  • DBSCAN: Density-based spatial clustering of applications with noise.
  • HDBSCAN: Hierarchical DBSCAN with simplified hierarchy extraction.
  • OPTICS/OPTICSXi: Ordering points to identify the clustering structure clustering algorithms.
  • FOSC: Framework for Optimal Selection of Clusters for unsupervised and semisupervised clustering of hierarchical cluster tree.
  • Jarvis-Patrick clustering
  • SNN Clustering: Shared Nearest Neighbor Clustering.

Outlier Detection

  • LOF: Local outlier factor algorithm.
  • GLOSH: Global-Local Outlier Score from Hierarchies algorithm.

Fast Nearest-Neighbor Search (using kd-trees)

  • kNN search
  • Fixed-radius NN search

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are typically faster than the native R implementations (e.g., dbscan in package fpc), or the implementations in WEKA, ELKI and Python's scikit-learn.

Installation

Stable CRAN version: install from within R with

install.packages("dbscan")

Current development version: Download package from AppVeyor or install from GitHub (needs devtools).

library("devtools") install_github("mhahsler/dbscan")

Usage

Load the package and use the numeric variables in the iris dataset

library("dbscan") data("iris") x <- as.matrix(iris[, 1:4])

Run DBSCAN

db <- dbscan(x, eps = .4, minPts = 4) db
DBSCAN clustering for 150 objects. Parameters: eps = 0.4, minPts = 4 The clustering contains 4 cluster(s) and 25 noise points. 0 1 2 3 4 25 47 38 36 4 Available fields: cluster, eps, minPts 

Visualize results (noise is shown in black)

pairs(x, col = db$cluster + 1L)

Calculate LOF (local outlier factor) and visualize (larger bubbles in the visualization have a larger LOF)

lof <- lof(x, k = 4) pairs(x, cex = lof)

Run OPTICS

opt <- optics(x, eps = 1, minPts = 4) opt
OPTICS clustering for 150 objects. Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = NA Available fields: order, reachdist, coredist, predecessor, minPts, eps, eps_cl, xi 

Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

opt <- extractDBSCAN(opt, eps_cl = .4) plot(opt)

Extract a hierarchical clustering using the Xi method (captures clusters of varying density)

opt <- extractXi(opt, xi = .05) opt plot(opt)

Run HDBSCAN (captures stable clusters)

hdb <- hdbscan(x, minPts = 4) hdb
HDBSCAN clustering for 150 objects. Parameters: minPts = 4 The clustering contains 2 cluster(s) and 0 noise points. 1 2 100 50 Available fields: cluster, minPts, cluster_scores, membership_prob, outlier_scores, hc 

Visualize the results as a simplified tree

plot(hdb, show_flat = T)

See how well each point corresponds to the clusters found by the model used

 colors <- mapply(function(col, i) adjustcolor(col, alpha.f = hdb$membership_prob[i]), palette()[hdb$cluster+1], seq_along(hdb$cluster)) plot(x, col=colors, pch=20)

License

The dbscan package is licensed under the GNU General Public License (GPL) Version 3. The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with explicit permission granted by the original author, Erich Schubert.

Further Information

Maintainer: Michael Hahsler

About

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 73.0%
  • R 22.4%
  • C 4.6%