A tool for big data analysis
● Ganesha Yadiyala ● Big data consultant at datamantra.io ● Consult in spark and scala ● ganeshayadiyala@gmail.com
Agenda ● Problem Statement ● Business view ● Why spark ● Thinking REST ● Load API ● Transform API ● Machine learning ● Pipeline API ● Save API
Problem Statement Build a generic solution which can be used to do transformation on data and then analyse it to get useful result out of it.
Business view ● This is an era of big data. ● All companies are trying to get something useful from the data and solve problems. ● There exists many frameworks in big data but we need a tool which will leverage most of them and can solve problem easily. ● So if there is a general solution or tool which can be able to solve many of these problem that would be a big plus.
Why we used spark There are many big data frameworks out there which can be used for analysis of data, but we chose to use spark because, ● Capability to handle multiple data source ● Easy binding with the external data ● Good support for machine learning through spark-ml and spark mllib
Thinking REST To do all this transformation and analysis we provided REST api because, ● Minimise the coupling between client and server ● Different clients can use the REST api to interact with the tool. ● Used Akka-http for rest service
Akka-http It is an Actor-based toolkit for interacting with web services and clients, ● It is also written in scala and it uses same configuration management library as spark ● It is an actor and future based
Rest server interaction Client Spark cluster Rest Server Call Rest API Call spark API
Rest server design ● Instead of going with spark jobserver we went with our own rest server ● Once the rest server is started spark context is created ● All the configuration is passed to the spark context through typesafe during its creation ● Same context is used for all the operation.
Loading from different sources We supported different types of data, ● Csv datasource ● Json datasource ● Parquet datasource ● Xml datasource
Loading from different sources We also supported some of the sources like, ● Mongodb ● Kafka ● JDBC ● Cassandra
Transformation In big data world data which is coming to the system cannot be used as it is, we may have to transform the data as needed for the operation We gave the API’s in REST to do this transformation, which internally call spark dataframe API’s
Example Some of the transformation we provided is, ● Cast - Cast the datatype of a column ● Filter - filter based on the formula or condition ● Aggregation - Max,min,sum,median etc ● Joins - Joining two datasets
Machine learning - spark ml Spark ml provides higher level API which is built on top of the dataframe. ● We did not used mllib because that is built on top of the rdd. ● We provided rest API which will talk to these ML apis
Example Some of the ml apis we provided are, ● Linear regression ● Decision tree (regressor and classifier) ● Ridge regression ● KMeans etc...
Challenges in spark ml ● It was very difficult to write generic api because not all the ml algorithms expect similar inputs ● Not all the apis are documented properly ● Validation on the type of the columns which can be given to these API are really difficult.
Save API Once the transformation is done or ml gives the output use may want to save the result. We support, ● text ● json ● parquet ● mongodb ● cassandra etc...
Pipeline and scheduling We also implemented a pipeline api which will pipe all the loading, transformation or ml apis. If the user want to run this operation at scheduled time it is possible through schedule API which we have provided.
Summary No solution will be able to solve all the big data problems, but we tried to build a tool which is generic enough to write your own transformation on data, analyse it and we can solve many of the problems

A Tool For Big Data Analysis using Apache Spark

  • 1.
    A tool forbig data analysis
  • 2.
    ● Ganesha Yadiyala ●Big data consultant at datamantra.io ● Consult in spark and scala ● ganeshayadiyala@gmail.com
  • 3.
    Agenda ● Problem Statement ●Business view ● Why spark ● Thinking REST ● Load API ● Transform API ● Machine learning ● Pipeline API ● Save API
  • 4.
    Problem Statement Build ageneric solution which can be used to do transformation on data and then analyse it to get useful result out of it.
  • 5.
    Business view ● Thisis an era of big data. ● All companies are trying to get something useful from the data and solve problems. ● There exists many frameworks in big data but we need a tool which will leverage most of them and can solve problem easily. ● So if there is a general solution or tool which can be able to solve many of these problem that would be a big plus.
  • 6.
    Why we usedspark There are many big data frameworks out there which can be used for analysis of data, but we chose to use spark because, ● Capability to handle multiple data source ● Easy binding with the external data ● Good support for machine learning through spark-ml and spark mllib
  • 7.
    Thinking REST To doall this transformation and analysis we provided REST api because, ● Minimise the coupling between client and server ● Different clients can use the REST api to interact with the tool. ● Used Akka-http for rest service
  • 8.
    Akka-http It is anActor-based toolkit for interacting with web services and clients, ● It is also written in scala and it uses same configuration management library as spark ● It is an actor and future based
  • 9.
    Rest server interaction Client Sparkcluster Rest Server Call Rest API Call spark API
  • 10.
    Rest server design ●Instead of going with spark jobserver we went with our own rest server ● Once the rest server is started spark context is created ● All the configuration is passed to the spark context through typesafe during its creation ● Same context is used for all the operation.
  • 11.
    Loading from differentsources We supported different types of data, ● Csv datasource ● Json datasource ● Parquet datasource ● Xml datasource
  • 12.
    Loading from differentsources We also supported some of the sources like, ● Mongodb ● Kafka ● JDBC ● Cassandra
  • 13.
    Transformation In big dataworld data which is coming to the system cannot be used as it is, we may have to transform the data as needed for the operation We gave the API’s in REST to do this transformation, which internally call spark dataframe API’s
  • 14.
    Example Some of thetransformation we provided is, ● Cast - Cast the datatype of a column ● Filter - filter based on the formula or condition ● Aggregation - Max,min,sum,median etc ● Joins - Joining two datasets
  • 15.
    Machine learning -spark ml Spark ml provides higher level API which is built on top of the dataframe. ● We did not used mllib because that is built on top of the rdd. ● We provided rest API which will talk to these ML apis
  • 16.
    Example Some of theml apis we provided are, ● Linear regression ● Decision tree (regressor and classifier) ● Ridge regression ● KMeans etc...
  • 17.
    Challenges in sparkml ● It was very difficult to write generic api because not all the ml algorithms expect similar inputs ● Not all the apis are documented properly ● Validation on the type of the columns which can be given to these API are really difficult.
  • 18.
    Save API Once thetransformation is done or ml gives the output use may want to save the result. We support, ● text ● json ● parquet ● mongodb ● cassandra etc...
  • 19.
    Pipeline and scheduling Wealso implemented a pipeline api which will pipe all the loading, transformation or ml apis. If the user want to run this operation at scheduled time it is possible through schedule API which we have provided.
  • 20.
    Summary No solution willbe able to solve all the big data problems, but we tried to build a tool which is generic enough to write your own transformation on data, analyse it and we can solve many of the problems