A Tool For Big Data Analysis using Apache Spark

● Ganesha Yadiyala ● Big data consultant at datamantra.io ● Consult in spark and scala ● ganeshayadiyala@gmail.com

Agenda ● Problem Statement ● Business view ● Why spark ● Thinking REST ● Load API ● Transform API ● Machine learning ● Pipeline API ● Save API

Problem Statement Build a generic solution which can be used to do transformation on data and then analyse it to get useful result out of it.

Business view ● This is an era of big data. ● All companies are trying to get something useful from the data and solve problems. ● There exists many frameworks in big data but we need a tool which will leverage most of them and can solve problem easily. ● So if there is a general solution or tool which can be able to solve many of these problem that would be a big plus.

Why we used spark There are many big data frameworks out there which can be used for analysis of data, but we chose to use spark because, ● Capability to handle multiple data source ● Easy binding with the external data ● Good support for machine learning through spark-ml and spark mllib

Thinking REST To do all this transformation and analysis we provided REST api because, ● Minimise the coupling between client and server ● Different clients can use the REST api to interact with the tool. ● Used Akka-http for rest service

Akka-http It is an Actor-based toolkit for interacting with web services and clients, ● It is also written in scala and it uses same configuration management library as spark ● It is an actor and future based

Rest server interaction Client Spark cluster Rest Server Call Rest API Call spark API

Rest server design ● Instead of going with spark jobserver we went with our own rest server ● Once the rest server is started spark context is created ● All the configuration is passed to the spark context through typesafe during its creation ● Same context is used for all the operation.

Loading from different sources We supported different types of data, ● Csv datasource ● Json datasource ● Parquet datasource ● Xml datasource

Loading from different sources We also supported some of the sources like, ● Mongodb ● Kafka ● JDBC ● Cassandra

Transformation In big data world data which is coming to the system cannot be used as it is, we may have to transform the data as needed for the operation We gave the API’s in REST to do this transformation, which internally call spark dataframe API’s

Example Some of the transformation we provided is, ● Cast - Cast the datatype of a column ● Filter - filter based on the formula or condition ● Aggregation - Max,min,sum,median etc ● Joins - Joining two datasets

Machine learning - spark ml Spark ml provides higher level API which is built on top of the dataframe. ● We did not used mllib because that is built on top of the rdd. ● We provided rest API which will talk to these ML apis

Example Some of the ml apis we provided are, ● Linear regression ● Decision tree (regressor and classifier) ● Ridge regression ● KMeans etc...

Challenges in spark ml ● It was very difficult to write generic api because not all the ml algorithms expect similar inputs ● Not all the apis are documented properly ● Validation on the type of the columns which can be given to these API are really difficult.

Save API Once the transformation is done or ml gives the output use may want to save the result. We support, ● text ● json ● parquet ● mongodb ● cassandra etc...

Pipeline and scheduling We also implemented a pipeline api which will pipe all the loading, transformation or ml apis. If the user want to run this operation at scheduled time it is possible through schedule API which we have provided.

Summary No solution will be able to solve all the big data problems, but we tried to build a tool which is generic enough to write your own transformation on data, analyse it and we can solve many of the problems

A Tool For Big Data Analysis using Apache Spark

More Related Content

What's hot

Viewers also liked

Similar to A Tool For Big Data Analysis using Apache Spark

More from datamantra

Recently uploaded

A Tool For Big Data Analysis using Apache Spark