1© Cloudera, Inc. All rights reserved. Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O’Reilly) Real Time Data Processing using Spark Streaming
2© Cloudera, Inc. All rights reserved. Motivation for Real-Time Stream Processing Data is being created at unprecedented rates • Exponential data growth from mobile, web, social • Connected devices: 9B in 2012 to 50B by 2020 • Over 1 trillion sensors by 2020 • Datacenter IP traffic growing at CAGR of 25% How can we harness it data in real-time? • Value can quickly degrade → capture value immediately • From reactive analysis to direct operational impact • Unlocks new competitive advantages • Requires a completely new approach...
3© Cloudera, Inc. All rights reserved. Use Cases Across Industries Credit Identify fraudulent transactions as soon as they occur. Transportation Dynamic Re-routing Of traffic or Vehicle Fleet. Retail • Dynamic Inventory Management • Real-time In-store Offers and recommendations Consumer Internet & Mobile Optimize user engagement based on user’s current behavior. Healthcare Continuously monitor patient vital stats and proactively identify at-risk patients. Manufacturing • Identify equipment failures and react instantly • Perform Proactive maintenance. Surveillance Identify threats and intrusions In real-time Digital Advertising & Marketing Optimize and personalize content based on real-time information.
4© Cloudera, Inc. All rights reserved. From Volume and Variety to Velocity Present Batch + Stream Processing Time to Insight of Seconds Big-Data = Volume + Variety Big-Data = Volume + Variety + Velocity Past Present Hadoop Ecosystem evolves as well… Past Big Data has evolved Batch Processing Time to insight of Hours
5© Cloudera, Inc. All rights reserved. Key Components of Streaming Architectures Data Ingestion & Transportation Service Real-Time Stream Processing Engine Kafka Flume System Management Security Data Management & Integration Real-Time Data Serving
6© Cloudera, Inc. All rights reserved. Canonical Stream Processing Architecture Kafka Data Ingest App 1 App 2 . . . Kafka Flume HDFS HBase Data Sources
7© Cloudera, Inc. All rights reserved. What is Spark? Spark is a general purpose computational framework - more flexibility than MapReduce Key properties: • Leverages distributed memory • Full Directed Graph expressions for data parallel computations • Improved developer experience Yet retains: Linear scalability, Fault-tolerance and Data Locality based computations
8© Cloudera, Inc. All rights reserved. Spark: Easy and Fast Big Data •Easy to Develop •Rich APIs in Java, Scala, Python •Interactive shell •Fast to Run •General execution graphs •In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
9© Cloudera, Inc. All rights reserved. Easy: High productivity language support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); • Native support for multiple languages with identical APIs • Use of closures, iterations and other common language constructs to minimize code
10© Cloudera, Inc. All rights reserved. Easy: Use Interactively • Interactive exploration of data for data scientists – no need to develop “applications” • Developers can prototype application on live system as they build application
11© Cloudera, Inc. All rights reserved. Easy: Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip • sample • take • first • partitionBy • mapWith • pipe • save
12© Cloudera, Inc. All rights reserved. Easy: Example – Word Count (M/R) public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce( Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
13© Cloudera, Inc. All rights reserved. Easy: Example – Word Count (Spark) val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
14© Cloudera, Inc. All rights reserved. Easy: Out of the Box Functionality • Hadoop Integration • Works with Hadoop Data • Runs With YARN • Libraries • MLlib • Spark Streaming • GraphX (alpha) • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs
15© Cloudera, Inc. All rights reserved. Spark Architecture Driver Worker Worker Worker Data RAM Data RAM Data RAM
16© Cloudera, Inc. All rights reserved. RDDs RDD = Resilient Distributed Datasets • Immutable representation of data • Operations on one RDD creates a new one • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage • Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage
17© Cloudera, Inc. All rights reserved. Fast: Using RAM, Operator Graphs • In-memory Caching • Data Partitions read from RAM instead of disk • Operator Graphs • Scheduling Optimizations • Fault Tolerance = cached partition = RDD join filter groupBy B: B: C: D: E: F: map A: map take
18© Cloudera, Inc. All rights reserved. Spark Streaming Extension of Apache Spark’s Core API, for Stream Processing. The Framework Provides Fault Tolerance Scalability High-Throughput
19© Cloudera, Inc. All rights reserved. Spark Streaming • Incoming data represented as Discretized Streams (DStreams) • Stream is broken down into micro-batches • Each micro-batch is an RDD – can share code between batch and streaming
20© Cloudera, Inc. All rights reserved. val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2tweets DStream hashTags DStream Stream composed of small (1-10s) batch computations “Micro-batch” Architecture
21© Cloudera, Inc. All rights reserved. Use DStreams for Windowing Functions
22© Cloudera, Inc. All rights reserved. Spark Streaming • Runs as a Spark job • YARN or standalone for scheduling • YARN has KDC integration • Use the same code for real-time Spark Streaming and for batch Spark jobs. • Integrates natively with messaging systems such as Flume, Kafka, Zero MQ…. • Easy to write “Receivers” for custom messaging systems.
23© Cloudera, Inc. All rights reserved. Sharing Code between Batch and Streaming def filterErrors (rdd: RDD[String]): RDD[String] = { rdd.filter(s => s.contains(“ERROR”)) } Library that filters “ERRORS” • Streaming generates RDDs periodically • Any code that operates on RDDs can therefore be used in streaming as well
24© Cloudera, Inc. All rights reserved. Sharing Code between Batch and Streaming val lines = sc.textFile(…) val filtered = filterErrors(lines) filtered.saveAsTextFile(...) Spark: val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435) val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => { filterErrors(rdd) })) filtered.saveAsTextFiles(…) Spark Streaming:
25© Cloudera, Inc. All rights reserved. Spark Streaming Use-Cases • Real-time dashboards • Show approximate results in real-time • Reconcile periodically with source-of-truth using Spark • Joins of multiple streams • Time-based or count-based “windows” • Combine multiple sources of input to produce composite data • Re-use RDDs created by Streaming in other Spark jobs.
26© Cloudera, Inc. All rights reserved. Hadoop in the Spark world YARN Spark Spark Streaming GraphX MLlib HDFS, HBase HivePig Impala MapReduce2 Shark Search Core Hadoop Support Spark components Unsupported add-ons
27© Cloudera, Inc. All rights reserved. Current project status • 100+ contributors and 25+ companies contributing • Includes: Databricks, Cloudera, Intel, Yahoo etc • Dozens of production deployments • Included in CDH!
28© Cloudera, Inc. All rights reserved. More Info.. • CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera- docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html • Cloudera Blog: http://blog.cloudera.com/blog/category/spark/ • Apache Spark homepage: http://spark.apache.org/ • Github: https://github.com/apache/spark
29© Cloudera, Inc. All rights reserved. Thank you

Real Time Data Processing Using Spark Streaming

  • 1.
    1© Cloudera, Inc.All rights reserved. Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O’Reilly) Real Time Data Processing using Spark Streaming
  • 2.
    2© Cloudera, Inc.All rights reserved. Motivation for Real-Time Stream Processing Data is being created at unprecedented rates • Exponential data growth from mobile, web, social • Connected devices: 9B in 2012 to 50B by 2020 • Over 1 trillion sensors by 2020 • Datacenter IP traffic growing at CAGR of 25% How can we harness it data in real-time? • Value can quickly degrade → capture value immediately • From reactive analysis to direct operational impact • Unlocks new competitive advantages • Requires a completely new approach...
  • 3.
    3© Cloudera, Inc.All rights reserved. Use Cases Across Industries Credit Identify fraudulent transactions as soon as they occur. Transportation Dynamic Re-routing Of traffic or Vehicle Fleet. Retail • Dynamic Inventory Management • Real-time In-store Offers and recommendations Consumer Internet & Mobile Optimize user engagement based on user’s current behavior. Healthcare Continuously monitor patient vital stats and proactively identify at-risk patients. Manufacturing • Identify equipment failures and react instantly • Perform Proactive maintenance. Surveillance Identify threats and intrusions In real-time Digital Advertising & Marketing Optimize and personalize content based on real-time information.
  • 4.
    4© Cloudera, Inc.All rights reserved. From Volume and Variety to Velocity Present Batch + Stream Processing Time to Insight of Seconds Big-Data = Volume + Variety Big-Data = Volume + Variety + Velocity Past Present Hadoop Ecosystem evolves as well… Past Big Data has evolved Batch Processing Time to insight of Hours
  • 5.
    5© Cloudera, Inc.All rights reserved. Key Components of Streaming Architectures Data Ingestion & Transportation Service Real-Time Stream Processing Engine Kafka Flume System Management Security Data Management & Integration Real-Time Data Serving
  • 6.
    6© Cloudera, Inc.All rights reserved. Canonical Stream Processing Architecture Kafka Data Ingest App 1 App 2 . . . Kafka Flume HDFS HBase Data Sources
  • 7.
    7© Cloudera, Inc.All rights reserved. What is Spark? Spark is a general purpose computational framework - more flexibility than MapReduce Key properties: • Leverages distributed memory • Full Directed Graph expressions for data parallel computations • Improved developer experience Yet retains: Linear scalability, Fault-tolerance and Data Locality based computations
  • 8.
    8© Cloudera, Inc.All rights reserved. Spark: Easy and Fast Big Data •Easy to Develop •Rich APIs in Java, Scala, Python •Interactive shell •Fast to Run •General execution graphs •In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 9.
    9© Cloudera, Inc.All rights reserved. Easy: High productivity language support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); • Native support for multiple languages with identical APIs • Use of closures, iterations and other common language constructs to minimize code
  • 10.
    10© Cloudera, Inc.All rights reserved. Easy: Use Interactively • Interactive exploration of data for data scientists – no need to develop “applications” • Developers can prototype application on live system as they build application
  • 11.
    11© Cloudera, Inc.All rights reserved. Easy: Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip • sample • take • first • partitionBy • mapWith • pipe • save
  • 12.
    12© Cloudera, Inc.All rights reserved. Easy: Example – Word Count (M/R) public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce( Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 13.
    13© Cloudera, Inc.All rights reserved. Easy: Example – Word Count (Spark) val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 14.
    14© Cloudera, Inc.All rights reserved. Easy: Out of the Box Functionality • Hadoop Integration • Works with Hadoop Data • Runs With YARN • Libraries • MLlib • Spark Streaming • GraphX (alpha) • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs
  • 15.
    15© Cloudera, Inc.All rights reserved. Spark Architecture Driver Worker Worker Worker Data RAM Data RAM Data RAM
  • 16.
    16© Cloudera, Inc.All rights reserved. RDDs RDD = Resilient Distributed Datasets • Immutable representation of data • Operations on one RDD creates a new one • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage • Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage
  • 17.
    17© Cloudera, Inc.All rights reserved. Fast: Using RAM, Operator Graphs • In-memory Caching • Data Partitions read from RAM instead of disk • Operator Graphs • Scheduling Optimizations • Fault Tolerance = cached partition = RDD join filter groupBy B: B: C: D: E: F: map A: map take
  • 18.
    18© Cloudera, Inc.All rights reserved. Spark Streaming Extension of Apache Spark’s Core API, for Stream Processing. The Framework Provides Fault Tolerance Scalability High-Throughput
  • 19.
    19© Cloudera, Inc.All rights reserved. Spark Streaming • Incoming data represented as Discretized Streams (DStreams) • Stream is broken down into micro-batches • Each micro-batch is an RDD – can share code between batch and streaming
  • 20.
    20© Cloudera, Inc.All rights reserved. val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2tweets DStream hashTags DStream Stream composed of small (1-10s) batch computations “Micro-batch” Architecture
  • 21.
    21© Cloudera, Inc.All rights reserved. Use DStreams for Windowing Functions
  • 22.
    22© Cloudera, Inc.All rights reserved. Spark Streaming • Runs as a Spark job • YARN or standalone for scheduling • YARN has KDC integration • Use the same code for real-time Spark Streaming and for batch Spark jobs. • Integrates natively with messaging systems such as Flume, Kafka, Zero MQ…. • Easy to write “Receivers” for custom messaging systems.
  • 23.
    23© Cloudera, Inc.All rights reserved. Sharing Code between Batch and Streaming def filterErrors (rdd: RDD[String]): RDD[String] = { rdd.filter(s => s.contains(“ERROR”)) } Library that filters “ERRORS” • Streaming generates RDDs periodically • Any code that operates on RDDs can therefore be used in streaming as well
  • 24.
    24© Cloudera, Inc.All rights reserved. Sharing Code between Batch and Streaming val lines = sc.textFile(…) val filtered = filterErrors(lines) filtered.saveAsTextFile(...) Spark: val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435) val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => { filterErrors(rdd) })) filtered.saveAsTextFiles(…) Spark Streaming:
  • 25.
    25© Cloudera, Inc.All rights reserved. Spark Streaming Use-Cases • Real-time dashboards • Show approximate results in real-time • Reconcile periodically with source-of-truth using Spark • Joins of multiple streams • Time-based or count-based “windows” • Combine multiple sources of input to produce composite data • Re-use RDDs created by Streaming in other Spark jobs.
  • 26.
    26© Cloudera, Inc.All rights reserved. Hadoop in the Spark world YARN Spark Spark Streaming GraphX MLlib HDFS, HBase HivePig Impala MapReduce2 Shark Search Core Hadoop Support Spark components Unsupported add-ons
  • 27.
    27© Cloudera, Inc.All rights reserved. Current project status • 100+ contributors and 25+ companies contributing • Includes: Databricks, Cloudera, Intel, Yahoo etc • Dozens of production deployments • Included in CDH!
  • 28.
    28© Cloudera, Inc.All rights reserved. More Info.. • CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera- docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html • Cloudera Blog: http://blog.cloudera.com/blog/category/spark/ • Apache Spark homepage: http://spark.apache.org/ • Github: https://github.com/apache/spark
  • 29.
    29© Cloudera, Inc.All rights reserved. Thank you

Editor's Notes

  • #3 The Narrative: Vast quantities of streaming data are being generated, and more will be generated thanks to phenomenon SINGULAR/PLURAL such as the internet of things. The motivation for Real-Time Stream processing is to turn all this data into valuable insights and actions, as soon as the data is generated. Instant processing of the data also opens the door to new use cases that were not possible before. NOTE: Feel free to remove the cheesy image of “The Flash”, if it feels unprofessional or overly cheesy
  • #5 The Narrative: As you can see from the previous slides, lots of streaming data will be generated. Making this data actionable in real time is very valuable across industries. Our very own Hadoop is all you need. Previously Hadoop was associated just with “big unstructured data”. That was hadoop’s selling point. But now, Hadoop can also handle real-time data (in addition to big unstructured). So think Hadoop when you think Real-Time Streaming. Purpose of the slide: Goal is to associate Hadoop with real-time……to get people to think hadoop when they think real-time streaming data.
  • #19 Purpose of this Slide: Make sure to associate Spark Streaming with Apache Spark, so folks know it is a part of THE Apache Spark that everyone is talking about. List some of the key properties that make Spark Streaming a good platform for stream processing. Touch upon the key attributes that make it good for stream processing. Note: If required, we can mention low latency as well.
  • #27 Spark Streaming – Storm like, Streaming solution Blink DB – approximate query results for Hive Shark SQL – Spark based SQL engine, but Impala is better GraphX – Graphlab, Giraph alternate that provides graph processing MLBase and MLLib – ML algorithm implementations Tachyon – Shared filesystem cache so multiple users/frameworks can share Spark data For now, we will only support Spark and Spark Streaming Over time, Pig will move to Spark natively and Spark will provide MR APIs