Apache Phoenix with Actor Model (Akka.io) for Real-time Big Data Programming Stack Why we still need SQL for Big Data ? How to make Big Data more responsive and faster ? By http://nguyentantrieu.info Tech Lead at eClick team - FPT Online
Contents 1. What is Big data and Why ? 2. When standard relational database (Oracle,MySQL, ...) is not good enough 3. Common problems in big data system 4. Introducing open-source tools in Big Data System a. Apache Phoenix for ad-hoc query b. Actor Model and Akka.io for reactive data processing
What Does Big Data Actually Mean? “Big data means data that cannot fit easily into a standard relational database.” Hal Varian- Chief Economist, Google http://www.brookings.edu/blogs/techtank/posts/2014/09/11-big-data-definition
When standard relational database (Oracle,MySQL, ...) is not good enough the “analytic system” MySQL database from a startup, tracking all actions in mobile games: iOS, Android, ...
Complex analytic system and the “scale” pain
Definition from the crowd “Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.” Jonathan Stuart Ward and Adam Barker Source: http://arxiv.org/abs/1309.5821 http://www.technologyreview.com/view/519851/the-big-data-conundrum-how-to-define- it/
“Chaotic” fact and the demand 80% of that data is unstructured or “chaotic” Photos, videos and social media posts - data that says so much about us - but cannot be analyzed via traditional methods Demand: “Finding order among chaos”
3 common problems in Big Data System 1. Size: the volume of the datasets is a critical factor. 2. Complexity: the structure, behaviour and permutations of the datasets is a critical factor. 3. Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.
Introducing open-source tools in Big Data System Apache Phoenix as SQL ad-hoc query engine Actor Model as nano-service for reactive data computation in the dawn of “Fast data”
Some innovative tools were born in the dawn of Big Data Age
But could an elephant fly without wings ?
But a phoenix can fly !
What is Apache Phoenix ? Apache Phoenix is a SQL skin over HBase. It means scaling Phoenix just like scale-up and scale-out the Hbase
Phoenix SQL Engine
Interesting features of Apache Phoenix ● Embedded JDBC driver implements the majority of java.sql interfaces, including the metadata APIs. ● Allows columns to be modeled as a multi-part row key or key/value cells. ● Full query support with predicate push down and optimal scan key formation. ● DDL support: CREATE TABLE, DROP TABLE, and ALTER TABLE for adding/removing columns. ● Versioned schema repository. Snapshot queries use the schema that was in place when data was written. ● DML support: UPSERT VALUES for row-by-row insertion, UPSERT SELECT for mass data transfer between the same or different tables, and DELETE for deleting rows. ● Limited transaction support through client-side batching. ● Single table only - no joins yet and secondary indexes are a work in progress. ● Follows ANSI SQL standards whenever possible ● Requires HBase v 0.94.2 or above ● 100% Java
the Phoenix table schema
Setting JDBC Phoenix Driver
Phoenix and SQL tool in Eclipse 4
Phoenix vs Hive (running over HDFS and HBase) http://phoenix.apache.org/performance.html
Actor Model in the dawn of “Fast data”
http://youtu.be/TnLiEWglqHk - Google I/O 2014 - The dawn of "Fast Data"
The paper: MillWheel: Fault-Tolerant Stream Processing at Internet Scale
What is actor model ? ● Carl Hewitt defined the Actor Model in 1973 as a mathematical theory that treats “Actors” as the universal primitives of concurrent digital computation. ● A fitting model for heavily-parallel processing in a cloud environment
What actor model ?
is the framework for implementing Actor computation
Inspired by MillWheel of Google and Storm of Twitter, I have developed my own framework, the “Rfx” (Reactive Functor Extension) with Akka as core
The pipeline of finding social trends in real-time analytics
Facebook Social Trending from a website
Quick demo Using Akka (Rfx) and Apache Phoenix for Social Media Real-time Analytics
Links for self-study and research Actor Model and Programming: ● http://nguyentantrieu.info/blog/the-architecture-for-real-time-event-processing-with- reactive-actor-model ● http://www.slideshare.net/drorbr/the-actor-model-towards-better-concurrency ● http://www.infoq.com/articles/reactive-cloud-actors ● http://www.mc2ads.com/p/rfx-for-big-data-developer.html Apache Phoenix ● http://java.dzone.com/articles/apache-phoenix-sql-driver ● http://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html Big Data and Data Science ● http://www.mc2ads.com and http://www.mc2ads.org ● http://datascience101.wordpress.com ● http://lambda-architecture.net ● http://www.bigdata-startups.com ● https://www.coursera.org/course/datasci

Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming Stack

  • 1.
    Apache Phoenix with Actor Model (Akka.io) for Real-time Big Data Programming Stack Why we still need SQL for Big Data ? How to make Big Data more responsive and faster ? By http://nguyentantrieu.info Tech Lead at eClick team - FPT Online
  • 2.
    Contents 1. Whatis Big data and Why ? 2. When standard relational database (Oracle,MySQL, ...) is not good enough 3. Common problems in big data system 4. Introducing open-source tools in Big Data System a. Apache Phoenix for ad-hoc query b. Actor Model and Akka.io for reactive data processing
  • 3.
    What Does BigData Actually Mean? “Big data means data that cannot fit easily into a standard relational database.” Hal Varian- Chief Economist, Google http://www.brookings.edu/blogs/techtank/posts/2014/09/11-big-data-definition
  • 4.
    When standard relationaldatabase (Oracle,MySQL, ...) is not good enough the “analytic system” MySQL database from a startup, tracking all actions in mobile games: iOS, Android, ...
  • 5.
    Complex analytic systemand the “scale” pain
  • 6.
    Definition from thecrowd “Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning.” Jonathan Stuart Ward and Adam Barker Source: http://arxiv.org/abs/1309.5821 http://www.technologyreview.com/view/519851/the-big-data-conundrum-how-to-define- it/
  • 7.
    “Chaotic” fact andthe demand 80% of that data is unstructured or “chaotic” Photos, videos and social media posts - data that says so much about us - but cannot be analyzed via traditional methods Demand: “Finding order among chaos”
  • 8.
    3 common problemsin Big Data System 1. Size: the volume of the datasets is a critical factor. 2. Complexity: the structure, behaviour and permutations of the datasets is a critical factor. 3. Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.
  • 9.
    Introducing open-source toolsin Big Data System Apache Phoenix as SQL ad-hoc query engine Actor Model as nano-service for reactive data computation in the dawn of “Fast data”
  • 10.
    Some innovative toolswere born in the dawn of Big Data Age
  • 11.
    But could anelephant fly without wings ?
  • 13.
    But a phoenixcan fly !
  • 14.
    What is ApachePhoenix ? Apache Phoenix is a SQL skin over HBase. It means scaling Phoenix just like scale-up and scale-out the Hbase
  • 15.
  • 16.
    Interesting features ofApache Phoenix ● Embedded JDBC driver implements the majority of java.sql interfaces, including the metadata APIs. ● Allows columns to be modeled as a multi-part row key or key/value cells. ● Full query support with predicate push down and optimal scan key formation. ● DDL support: CREATE TABLE, DROP TABLE, and ALTER TABLE for adding/removing columns. ● Versioned schema repository. Snapshot queries use the schema that was in place when data was written. ● DML support: UPSERT VALUES for row-by-row insertion, UPSERT SELECT for mass data transfer between the same or different tables, and DELETE for deleting rows. ● Limited transaction support through client-side batching. ● Single table only - no joins yet and secondary indexes are a work in progress. ● Follows ANSI SQL standards whenever possible ● Requires HBase v 0.94.2 or above ● 100% Java
  • 18.
  • 19.
  • 20.
    Phoenix and SQLtool in Eclipse 4
  • 21.
    Phoenix vs Hive (running over HDFS and HBase) http://phoenix.apache.org/performance.html
  • 22.
    Actor Model inthe dawn of “Fast data”
  • 23.
    http://youtu.be/TnLiEWglqHk - GoogleI/O 2014 - The dawn of "Fast Data"
  • 24.
    The paper: MillWheel:Fault-Tolerant Stream Processing at Internet Scale
  • 25.
    What is actormodel ? ● Carl Hewitt defined the Actor Model in 1973 as a mathematical theory that treats “Actors” as the universal primitives of concurrent digital computation. ● A fitting model for heavily-parallel processing in a cloud environment
  • 26.
  • 27.
    is the frameworkfor implementing Actor computation
  • 28.
    Inspired by MillWheelof Google and Storm of Twitter, I have developed my own framework, the “Rfx” (Reactive Functor Extension) with Akka as core
  • 29.
    The pipeline offinding social trends in real-time analytics
  • 30.
  • 31.
    Quick demo UsingAkka (Rfx) and Apache Phoenix for Social Media Real-time Analytics
  • 32.
    Links for self-studyand research Actor Model and Programming: ● http://nguyentantrieu.info/blog/the-architecture-for-real-time-event-processing-with- reactive-actor-model ● http://www.slideshare.net/drorbr/the-actor-model-towards-better-concurrency ● http://www.infoq.com/articles/reactive-cloud-actors ● http://www.mc2ads.com/p/rfx-for-big-data-developer.html Apache Phoenix ● http://java.dzone.com/articles/apache-phoenix-sql-driver ● http://phoenix.apache.org/Phoenix-in-15-minutes-or-less.html Big Data and Data Science ● http://www.mc2ads.com and http://www.mc2ads.org ● http://datascience101.wordpress.com ● http://lambda-architecture.net ● http://www.bigdata-startups.com ● https://www.coursera.org/course/datasci