A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability

A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability Kamlesh Kumar Pandey Research Scholar Dept. of Computer Science & Applications Dr. HariSingh Gour Vishwavidyalaya (A Central University), Sagar, M.P. E-mail: kamleshamk@gmail.com International Conference on Social Networking and Computational Intelligence (Paper ID : 173) Paper Presentation on

Content • Objectives • Big Data • Big Data Mining • Clustering taxonomy • Analysis of Clustering Algorithm for Big Data Mining • Summarization of Clustering Algorithm based on Three-Dimensional of Big Data • Proposed MapReduce Framework for the Clustering Algorithm • Experimental

Objectives • The objective of this study is identifying a traditional clustering algorithms for big data respect to volume, variety, and velocity and built the common executable framework for clustering algorithm with the MapReduce approach under big data mining.

Big Data • Present time technology is growing very fast. Every originations, industries or person moving towards Internet of things, cloud computing, warless sensor networks, social media, internet. These sources generated a data growing fast in per second, minutes or per hour in size of Terabytes or Petabytes . • Diebold et Al. (2000) is a first writer who discussed the word Big Data in his research paper. All of these authors define Big Data there means if the data set is large then gigabyte then these type of data set is known as Big Data. • Doug Laney et al (2001) was the first person who gave a proper definition for Big Data. He gave three characteristics Volume, Variety, and Velocity of Big Data and these characteristics known as 3 V’s of Big Data Management. If traditional data have met two basic characteristic at a time these data are come to under Big data. • Gartner (2012), “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making”

Big Data V’s • In present time seven V’s used for Big Data where the first three V’s Volume, Variety, and Velocity are the main characteristics of big data. In addition to Veracity, Variability, Value, and Visualization are depending on the organization.

Big Data Mining • Big Data Mining fetching on the requested information, uncovering hidden relationship or patterns or extracting for the needed information or knowledge from a dataset these datasets have to meet three V’s of Big Data with higher complexity.

Clustering • Clustering is the one of the approaches for analysis and discovering the complex relation, pattern, and data in the form of underlying groups for the unlabeled object and Big Data perspective, the clustering algorithm must be deal high volume, high variety and high velocity with scalability.

Clustering Taxonomy • Partitioning based Clustering: These clustering methods divided the dataset into K partition based on the distance function. • Hierarchical based Clustering: In this approach, large data are organized in a hierarchical manner based on the medium of proximity and its detect on easily relationship between data points. • Density Based Clustering: These clustering methods divided the dataset into based on the higher density of the data space. • Grid-Based Clustering: The core idea of grid clustering algorithms is that original data space is converted into a grid format which defines the size for clustering. • Model-Based Clustering: These clustering methods divided the data set into based on models such as mathematics, and statistical distribution.

Analysis of Clustering Algorithm for Big Data Mining • Design of clustering algorithms needs some criteria for big data mining, which is defining to Volume, Velocity, and Variety and increases the efficiency of the clustering. • Volume related criteria such as cluster is must be dealt huge size, high dimensional and noisy of the dataset. • Variety related criteria such as cluster is must be recognized as dataset categorization and clusters Shape. • Velocity related criteria define the complexity, scalability, and performance of the clustering algorithm during the execution of real dataset.

Summarization of Clustering Algorithm based on Three-Dimensional of Big Data Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Partition based Clustering K-Means Large No High Numerical Convex Medium 0 (knt) K-Medoies Small No Low Categorical Convex Low 0(k(n-k)2) PAM Small No Low Numerical Convex Low 0 (k3 * n2) CLARA Large No Low Numerical Convex High 0(ks2+k(n-k)) CLARANS Large No Low Numerical Convex Medium 0(n2)

Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(2) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Hierarchic al based Clustering BIRCH Large No Low Numerical Convex High 0(n) CURE Large Yes High Numerical Arbitrary High 0(n2logn) ROKE Small Yes Low Numerical/Ca tegorical Arbitrary Medium 0(n2logn) Chameleon Small No Low All type Data Arbitrary High 0(n2) ECHIDNA Large No Low Multivariate Convex High 0(nb(1+logbm)

Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(3) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Density based Clustering DBSCAN Large No Low Numerical Arbitrary Medium 0(nlogn) OPTICS Large No Low Numerical Arbitrary Medium 0(nlogn) Mean-shift Small No Low Numerical Arbitrary Low 0 (kernel) DENCLUE Large Yes High Numerical Arbitrary Medium 0(log |d|) GDBSCAN Large No Low Numerical Arbitrary Medium ----------------

Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(4) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Grid based Clustering STING Large Yes Small Spatial Arbitrary High 0(n) CLIQUE Small Yes Medium Numerical Convex High 0(n+k2) Wave Cluster Large No High Spatial Arbitrary Medium 0(n) OptiGrid Large Yes High Spatial Arbitrary Medium 0(nd) to 0(nd-log n) MAFIA Large No High Numerical Arbitrary High 0(cp + pn)

Summarization of Clustering Algorithm based on Three-Dimensional of Big Data(5) Clustering Categories Clustering Algorithm Volume Variety Velocity Dataset size High dimensional data Handling Noisy data Dataset type Cluster shape Scalability Time complexity Model based Clustering COBWEB Large No Medium Numerical Arbitrary Medium 0(n2) SLINK Large No Medium Numerical Arbitrary Medium 0(n2) SOM Small Yes Low Multivariate Arbitrary Low 0(n2m) ART Large No High Multivariate Arbitrary High (type+layer) EM Large Yes Low Spatial Convex 0(knp)

Proposed MapReduce Framework for the Clustering Algorithm • If any clustering algorithm works under huge dataset or high dimensional with scalability and heterogeneous data in the form of arbitrary shape so they suitable for big data mining. • Designing of a clustering algorithm for big data mining has a capability for parallel and distributed computing. MapReduce is one of the programming model for implementation of big data mining. • MapReduce techniques are inspired by the Map and Reduce function. • The idea of Map function is breakdown to a task into possible phases and executes these phases in parallel order without disturbing any phases. Map function also assigns appropriate key/value pairs in every data. • Reduce function collects all map results and combining all values based on the same key and given a final result of the MapReduce computational task. This concept reduces the computational time for big data mining

Proposed MapReduce Framework for the Clustering Algorithm(2) Step 1: Big data set is transformed into <key, value> pairs because MapReduce used to HDFS with parallel and distributed computing. Step 2: Mapper function takes <key, value> pairs as input and executes on parallel order according to the existing clustering algorithm. Step 3: Combiner function combine all Map results and sort every <value> according to <key> and given to output as <key, list (value)> format. Step 4: Reduce function takes the output from Combiner function and maps to one <key, list (value)> to another <key, list (value)> according to existing clustering algorithm and calculate the final cluster result. Step 5: Reduce function given the accurate and unique number of cluster.

Proposed MapReduce Framework for the Clustering Algorithm(3)

Experimental • K-Means, BIRCH, CLARA, CURE, DBSCAN, DENCLUE, Wavecluster are some good clustering algorithm for big data mining because it fulfills the criteria of big data clustering. • Dataset: - Power ( 512,320 real data points with 7 dimensions) • System:- Intel I3 processor, 4 GB RAM, 320 GB hard disk, windows 7. we show execution time of existing K-Mean and MapReduce base K-Mean clustering algorithm. Algorithm Execution time in second K-mean (existing) 60 K-mean (Proposed MapReduce Based) 20

References [1]. Sivarajah U. and Kamal M.M.: Critical analysis of Big Data challenges and analytical methods, Journal of Business Research (Elsevier), Vol 70, pp 263-286, DOI: 10.1016/j.jbusres.2016.08.001, (2017). [2]. Wasastjerna M.C.: The role of big data and digital privacy in merger review. European Competition Journal, vol. 14, no. 2-3, pp. 417- 444, DOI: 10.1080/17441056.2018.1533364, (2018). [3]. Gandomi A., and H. M.: Beyond the hype Big data concepts methods and analytics. I.J. of Info. Man., vol. 35, no. 2, pp. 137 -144, DOI: 10.1016/j.ijinfomgt.2014.10.007, (2015). [4]. Pandey K.K.: Mining on Relationship in Big Data era Using Apriori Algorithm, Proc. Of NCDAMLS, pp. 55-60, ISBN: 978-93-5291- 457-9, (2018). [5]. Che D., P. Z., and S.M., and From Big Data to Big Data Mining Challenges Issues and Opportunities. LNCS, vol. 7827, pp. 1-12 , doi 10.1007/978-3-642-40270-8_1, (2013). [6]. Li N., Zeng L., Qing H., and Zhongzhi S.: Parallel Implementation of Apriori Algorithm Based on MapReduce. Proc of 13th IEEE ACIS International Conference on SEAIPDC, DOI: 10.1109/SNPD.2012.31, (2017). [7]. Oussous A., Benjelloun F.Z., Lahcen A.A., and Belfkih S.: Big Data technologies: A survey, Journal of King Saud University – Computer and Information Sciences, Vol-30, pp 431–448, DOI: 10.1016/j.jksuci.2017.06.001, (2018). [8]. Chen M., M.S., and L.Y.: Big Data A Survey. Mob. Netw. Appl., vol. 19, no. 2, pp. 171–209, doi 10.1007/s11036-013-0489-0, (2014). [9]. Gole S., and Tidke B.: A survey of Big Data in social media using data mining techniques. Proc. of IEEE ICACCS, doi 10.1109/ICACCS.2015.7324059, (2015). [10]. Elgendy N., and E. A.: Big Data Analytics A Literature Review Paper. LNAI, vol. 8557, pp. 214–227, doi 10.1007/978-3-319-08976- 8_16, (2014). [11]. Ozkose H., Ari E.S., and Gencer C.: Yesterday, Today and Tomorrow of Big Data, Procedia - Social and Behavioral Sciences, vol. 195, pp. 1042-1050, doi 10.1016/j.sbspro.2015.06.147, (2015).

References [12]. Kaur P. and Kaur K., :Comparative Study of Techniquesand Issues in Data Clustering, Lecture Notes in Networks and Systems, Vol-8, pp 1-8, DOI 10.1007/978-981-10-3818-1_1,(2017). [13]. Nagpal A., Jatain A. and Gaur D.:Review based on Data Clustering Algorithms, Proc. of IEEE Conference on ICT, published by IEEE Xplore,pp 298-303, DOI: 10.1109/CICT.2013.6558109, (2013). [14]. Berkhin P.,:Survey of Clustering Data Mining Techniques, M. (eds) Grouping Multidimensional Data, pp. 25-71, doi 10.1007/3-540- 28349-8_2, (2006). [15]. Chen W.,OliverioJ.,Kim H.O, and Shen J., The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data, Journal of Industrial Integration and Management: Innovation and Entrepreneurship, DOI:10.1142/S2424862218500173,(2018). [16]. Xu R.,and Wunsch D. : Survey of Clustering Algorithms, IEEE TRANSACTIONS ON NEURAL NETWORKS, Vol. 16, Issue 3, pp 645-678, (2005). [17]. Xu D., and Tian Y.: A Comprehensive Survey of Clustering Algorithms, Annals of Data Science, Vol 2, Issue 2, pp 165–193,DOI: 10.1007/s40745-015-0040-1,(2015). [18]. Pandove D.and Goel S.: A Comprehensive Study on ClusteringApproaches for Big Data Mining, Proc. Of IEEE ICECS, pp 1333- 1338,(2015). [19]. Fahad A; Alshatri N, Tari Z, Alamri A, Khalil I, AND ZomayaA.Y.,:A Survey of Clustering Algorithms for BigData: Taxonomy and Empirical Analysis, IEEE Transactions on Emerging Topics in Computing, Vol 2, Issue 3,pp 267 - 279, DOI: 10.1109/TETC.2014.2330519 , (2014). [20]. Jain A. K., Murty M. N. and Flynn P. J., Data clustering: a review, ACM Computing Surveys, Vol 31,Issue 3, pp 264-323, DOI: 10.1145/331499.331504,(1999). [21]. Shirkhorshidi A.S., Aghabozorgi S, Wah T.Y. and HerawanT.:Big Data Clustering: A Review, published by Lecture Notes in Computer Science(Springer), Vol 8583, DOI: 10.1007/978-3-319-09156-3_49,(2014).

References [22]. Berkhin P., A Survey of Clustering Data Mining Techniques, Grouping Multidimensional Data (Springer), DOI: 10.1007/3-540-28349- 8_2 (2006). [23]. Pujari A.K, Rajesh K. & Reddy D.S.: Clustering Techniques in Data Mining—A Survey, IETE Journal of Research, vol 47, Issue 1-2, pp 19-28, DOI: 10.1080/03772063.2001.11416199,(2001). [24]. Dave M., and Gianey R. : Different Clustering Algorithms for Big Data Analytics: A Review, Proc of IEEE SMART, pp 328-333,(2016). [25]. Macqueen J.: Some methods for classification and analysis of multivariate observations. Proceedings 5th Berkeley Symposium on Mathematical Statistics Probability, Vol 1,pp 281–297,(1967). [26]. Emani C.K., Cullot N. and Nicolle C: Understandable Big Data: A survey, Computer Science Review, Vol-17, pp 70-81, DOI: dx.doi.org/10.1016/j.cosrev.2015.05.002, (2015).

A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability

A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability

More Related Content

What's hot

Similar to A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability

Recently uploaded

A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability