Efficient and portable data processing with Apache Beam and HBase Eugene Kirpichov, Google
History of Beam Philosophy of the Beam programming model Agenda 1 2 Apache Beam project3 Beam and HBase4
The Evolution of Apache Beam MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
(2008) FlumeJava High-level API (2016) Apache Beam Open ecosystem, Community-driven Vendor-independent (2004) MapReduce SELECT + GROUPBY (2013) Millwheel Deterministic Streaming (2014) Dataflow Batch/streaming agnostic, Infinite out-of-order data, Portable
Beam model: Unbounded, temporal, out-of-order data Unified No concept of "batch" / "streaming" at all Time Event time (when it happened, not when we saw it) Windowing Aggregation within time windows Keys Windows scoped to a key (e.g. user sessions) Triggers When is a window "complete enough" What to do when late data arrives
What are you computing? Transforms Where in event time? Windowing When in processing time? How do refinements relate? Triggers
What Where When How What - transforms Element-Wise Aggregating Composite
Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via( word → Arrays.asList(word.split("[^a-zA-Z']+")))) .apply(Filter.byPredicate(word → !word.isEmpty())) .apply(Count.perElement()) .apply(MapElements.via( count → count.getKey() + ": " + count.getValue()) .apply(TextIO.Write.to("gs://.../...")); p.run();
Where - windowing What Where When How ● Windowing divides data into event-time-based finite chunks. ● Required when doing aggregations over unbounded data.
What Where When How When - triggers Control when a window emits results of aggregation Often relative to the watermark (promise about lateness of a source) ProcessingTime Event Time Watermark
PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(Sum.integersPerKey()); What Where When How How do refinements relate?
1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 4. Streaming with Speculative + Late Data Customizing What Where When How What Where When How
Apache Beam Project3
What is Apache Beam? 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- Java, Python 3. Runners for Existing Distributed Processing Backends ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ (WIP) Gearpump and others ○ Local (in-process) runner for testing
The Apache Beam Vision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
Apache Beam ecosystem End-user's pipeline Libraries: transforms, sources/sinks etc. Language-specific SDK Beam model (ParDo, GBK, Windowing…) Runner Execution environment Java ...Python
02/01/2016 Enter Apache Incubator Early 2016 Internal API redesign and relative chaos Mid 2016 Stabilization of New APIs Late 2016 Multiple runners 02/25/2016 1st commit to ASF repository 05/2017 Beam 2.0 First Stable Release Early 2017 Polish and stability
Apache Beam Community 178 contributors 24 committers from 8 orgs (none >50%) >3300 PRs, >8600 commits, 27 releases >20 IO (storage system) connectors 5 runners
Beam and HBase4
Beam IO connector ecosystem Many uses of Beam = importing data from one place to another Files Text, Avro, XML, TFRecord (pluggable FS - local, HDFS, GCS) Hadoop ecosystem HBase, HadoopInputFormat, Hive (HCatalog) Streaming systems Kafka, Kinesis, MQTT, JMS, (WIP) AMQP Google Cloud Pubsub, BigQuery, Datastore, Bigtable, Spanner Other JDBC, Cassandra, Elasticsearch, MongoDB, GridFS
HBaseIO PCollection<Result> data = p.apply( HBaseIO.read() .withConfiguration(conf) .withTableId(table) … withScan, withFilter …) PCollection<KV<byte[], Iterable<Mutation>>> mutations = …; mutations.apply( HBaseIO.write() .withConfiguration(conf)) .withTableId(table)
IO Connectors = just Beam transforms Made of Beam primitives ParDo, GroupByKey, … Write = often a simple ParDo Read = a couple of ParDo, “Source API” for power users ⇒ straightforward to develop, clean API, very flexible, batch/streaming agnostic
Beam Write with HBase A bundle is a group of elements processed and committed together. APIs (ParDo/DoFn): setup() -> Creates Connection startBundle() -> Gets BufferedMutator processElement() -> Applies Mutation(s) finishBundle() -> BufferedMutator flush tearDown() -> Connection close Mutations must be idempotent, e.g. Put or Delete. Increment and Append should not be used. Transaction
Beam Source API (similar to Hadoop InputFormat, but cleaner / more general) Estimate size Split into sub-sources (of ~given size) Read Iterate Get progress Dynamic split Note: Separate API for unbounded sources + (WIP) a new unified API
HBase on Beam Source API HBaseSource Scan Estimate RegionSizeCalculator Split RegionLocation Read Iterate ResultScanner Get progress Key interpolation Dynamic split* RangeTracker Region Server 1 Region Server 2 * Dynamic Split for HBaseIO PR in progress
Google Cloud Platform 26 Digression: StragglersWorkers Time
Beam approach: Dynamic splitting* Workers Time Now Avg completion time *Currently implemented only by Dataflow
Autoscaling
Learn More! Apache Beam https://beam.apache.org The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 No Shard Left Behind Straggler Free Data Processing in Cloud Dataflow Join the mailing lists! user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter
Thank you

HBaseCon2017 Efficient and portable data processing with Apache Beam and HBase

  • 1.
    Efficient and portabledata processing with Apache Beam and HBase Eugene Kirpichov, Google
  • 2.
    History of Beam Philosophyof the Beam programming model Agenda 1 2 Apache Beam project3 Beam and HBase4
  • 3.
    The Evolution ofApache Beam MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  • 4.
    (2008) FlumeJava High-level API (2016)Apache Beam Open ecosystem, Community-driven Vendor-independent (2004) MapReduce SELECT + GROUPBY (2013) Millwheel Deterministic Streaming (2014) Dataflow Batch/streaming agnostic, Infinite out-of-order data, Portable
  • 5.
    Beam model: Unbounded,temporal, out-of-order data Unified No concept of "batch" / "streaming" at all Time Event time (when it happened, not when we saw it) Windowing Aggregation within time windows Keys Windows scoped to a key (e.g. user sessions) Triggers When is a window "complete enough" What to do when late data arrives
  • 6.
    What are youcomputing? Transforms Where in event time? Windowing When in processing time? How do refinements relate? Triggers
  • 7.
    What Where WhenHow What - transforms Element-Wise Aggregating Composite
  • 8.
    Pipeline p =Pipeline.create(options); p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*")) .apply(FlatMapElements.via( word → Arrays.asList(word.split("[^a-zA-Z']+")))) .apply(Filter.byPredicate(word → !word.isEmpty())) .apply(Count.perElement()) .apply(MapElements.via( count → count.getKey() + ": " + count.getValue()) .apply(TextIO.Write.to("gs://.../...")); p.run();
  • 9.
    Where - windowing WhatWhere When How ● Windowing divides data into event-time-based finite chunks. ● Required when doing aggregations over unbounded data.
  • 10.
    What Where WhenHow When - triggers Control when a window emits results of aggregation Often relative to the watermark (promise about lateness of a source) ProcessingTime Event Time Watermark
  • 11.
    PCollection<KV<String, Integer>> output= input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(Sum.integersPerKey()); What Where When How How do refinements relate?
  • 12.
    1.Classic Batch 2.Batch with Fixed Windows 3. Streaming 4. Streaming with Speculative + Late Data Customizing What Where When How What Where When How
  • 13.
  • 14.
    What is ApacheBeam? 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- Java, Python 3. Runners for Existing Distributed Processing Backends ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ (WIP) Gearpump and others ○ Local (in-process) runner for testing
  • 15.
    The Apache BeamVision 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
  • 16.
    Apache Beam ecosystem End-user'spipeline Libraries: transforms, sources/sinks etc. Language-specific SDK Beam model (ParDo, GBK, Windowing…) Runner Execution environment Java ...Python
  • 17.
    02/01/2016 Enter Apache Incubator Early 2016 InternalAPI redesign and relative chaos Mid 2016 Stabilization of New APIs Late 2016 Multiple runners 02/25/2016 1st commit to ASF repository 05/2017 Beam 2.0 First Stable Release Early 2017 Polish and stability
  • 18.
    Apache Beam Community 178contributors 24 committers from 8 orgs (none >50%) >3300 PRs, >8600 commits, 27 releases >20 IO (storage system) connectors 5 runners
  • 19.
  • 20.
    Beam IO connectorecosystem Many uses of Beam = importing data from one place to another Files Text, Avro, XML, TFRecord (pluggable FS - local, HDFS, GCS) Hadoop ecosystem HBase, HadoopInputFormat, Hive (HCatalog) Streaming systems Kafka, Kinesis, MQTT, JMS, (WIP) AMQP Google Cloud Pubsub, BigQuery, Datastore, Bigtable, Spanner Other JDBC, Cassandra, Elasticsearch, MongoDB, GridFS
  • 21.
    HBaseIO PCollection<Result> data =p.apply( HBaseIO.read() .withConfiguration(conf) .withTableId(table) … withScan, withFilter …) PCollection<KV<byte[], Iterable<Mutation>>> mutations = …; mutations.apply( HBaseIO.write() .withConfiguration(conf)) .withTableId(table)
  • 22.
    IO Connectors =just Beam transforms Made of Beam primitives ParDo, GroupByKey, … Write = often a simple ParDo Read = a couple of ParDo, “Source API” for power users ⇒ straightforward to develop, clean API, very flexible, batch/streaming agnostic
  • 23.
    Beam Write withHBase A bundle is a group of elements processed and committed together. APIs (ParDo/DoFn): setup() -> Creates Connection startBundle() -> Gets BufferedMutator processElement() -> Applies Mutation(s) finishBundle() -> BufferedMutator flush tearDown() -> Connection close Mutations must be idempotent, e.g. Put or Delete. Increment and Append should not be used. Transaction
  • 24.
    Beam Source API (similarto Hadoop InputFormat, but cleaner / more general) Estimate size Split into sub-sources (of ~given size) Read Iterate Get progress Dynamic split Note: Separate API for unbounded sources + (WIP) a new unified API
  • 25.
    HBase on BeamSource API HBaseSource Scan Estimate RegionSizeCalculator Split RegionLocation Read Iterate ResultScanner Get progress Key interpolation Dynamic split* RangeTracker Region Server 1 Region Server 2 * Dynamic Split for HBaseIO PR in progress
  • 26.
    Google Cloud Platform26 Digression: StragglersWorkers Time
  • 27.
    Beam approach: Dynamicsplitting* Workers Time Now Avg completion time *Currently implemented only by Dataflow
  • 28.
  • 29.
    Learn More! Apache Beam https://beam.apache.org TheWorld Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 No Shard Left Behind Straggler Free Data Processing in Cloud Dataflow Join the mailing lists! user-subscribe@beam.apache.org dev-subscribe@beam.apache.org Follow @ApacheBeam on Twitter
  • 30.