Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza ​Khai Tran ​Staff Software Engineer
Agenda ● Overview of LinkedIn metrics platform ● Moving from offline to nearline ● Under the hood of the nearline architecture ● Nearline production usecase ● Conclusion
Overview of LinkedIn metrics platform
Metrics @ LinkedIn ● Metrics = Measurements over tracking data ● Crucial for decision making: ○ Experimentation - test everything ○ Reporting - monitor and alert ○ In production, site-facing applications
We provide: ● A trusted repository of metrics ● A self-serve platform for sustainable lifecycle of metrics In production Experimentation Reporting Primary Data Unified Metrics Platform LinkedIn unified metrics platform (UMP)
Growth of UMP Metrics 2016 20172015 6800 4680 1100 Current: 8000+ metrics
# code LOAD … # data # transformation # code STORE … # config Metrics: - A = SUM(A’) - B = Unique(id) Downstream: - XLNT - Raptor UMP User Code Platform Generated Code To App To App DefineDeclare Onboard Data Metadata Onboarding process User
Moving from offline to nearline
Offline computation flows Hourly job latency: 3-6 hours -> want realtime/nearline ...... Metric union User code User code Cubing, Rollup Dimension augmentation HDFS tables Dali views Pinot, Presto Azkaban execution Espresso, Oracle, MySQL
... What we want for nearline flows Metric unionUser code User code Samza job Dimension augmentation Pinot
Latency is not the only requirement Easy to onboard ● Minimum effort to convert existing offline into nearline ● Easy to write user code for new nearline flows Easy to maintain ● Just one version of user code - single source of truth ● Run as a service Latency ● ~5 - 30 mins
Samza jobs Putting things together Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
Current support User code in Pig ● LOAD, STORE ● FILTER, SAMPLE, SPLIT, UNION ● Simple FOREACH ● JOIN - all semantics ● GROUP/COGROUP, DISTINCT ● Record/Array FLATTEN ● Java UDFs, Python UDFs ● Pig Nested FOREACH and sort/limit (in Windows) ● Hive Not yet
Under the hood of the nearline architecture
Pig to Samza through SQL processing Open source framework for building dynamic data management systems. Including: ➢ SQL Parser ➢ Relational algebra APIs ➢ Query planning engine We built UMP nearline with: ➢ Pig’s Grunt parser ➢ Calcite relational algebra ➢ Calcite query planning engine
Architecture ... Metric union User code User code Dimension augmentation Calcite relational algebra as an IR convert generate Samza code optimize Samza physical plan Samza configuration Pig to Calcite Calcite to Samza
Pig to Calcite # code LOAD … LOAD ... COGROUP ... STORE … GruntParser CO- GROUP LOAD LOAD PigRelConverter FULL OUTERJ OIN AGGRE GATE AGGRE GATE TABLE SCAN TABLE SCAN PRO- JECT User scripts Pig Logical Plan Calcite relational algebra
Example
Example
Example INNER JOIN FILTER FILTER PROJECT PROJECT PROJECT TABLE SCAN TABLE SCAN Calcite logical plan
Planning/Optimization ➢ Calcite logical plans: ○ Relational algebra: What to do ➢ Samza physical plans: ○ Samza physical node: How to do it ➢ Calcite Samza planner: ○ Calcite logical plan -> optimized Samza physical plan
Example Stream- Stream Self Join Samza Project Samza Project Samza Filter Samza Filter Samza Project Input Stream INNER JOIN FILTER FILTER PROJECT PROJECT PROJECT TABLE SCAN TABLE SCAN Calcite Samza planner Calcite logical plan Samza physical plan
Code-gen From Samza physical plans: ➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs . Mapping: ➢ Samza physical nodes -> corresponding stream APIs: ○ Samza project -> stream.map() ○ Samza filter -> stream.filter() ○ ... ➢ Relational expressions -> lamba functions: ○ Filter expressions -> filter() functions ○ Project expressions -> map() functions ○ ...
Example Schema and UDF declarations Operator mapping Filter functions Map functions Produce to Kafka ...
Config-gen Stream Stream Join Samza Project Samza Project Samza Filter Samza Filter Samza Project Input Stream # dataset.conf app-src app-def
Nearline production use case - Storylines
Top stories picked up by editors
Feedback to editor - powered by UMP realtime
Conclusion
Samza jobs From improved Lambda architecture... Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
… to our bigger picture Pig Latin Calcite relational algebra HiveQL SparkSQL/ RDD Presto SQL Portable UDFs AORA (Author Once, Run Anywhere) architecture
Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza

  • 1.
    Conquering the Lambdaarchitecture in LinkedIn metrics platform with Apache Calcite and Apache Samza ​Khai Tran ​Staff Software Engineer
  • 2.
    Agenda ● Overview ofLinkedIn metrics platform ● Moving from offline to nearline ● Under the hood of the nearline architecture ● Nearline production usecase ● Conclusion
  • 3.
    Overview of LinkedInmetrics platform
  • 4.
    Metrics @ LinkedIn ●Metrics = Measurements over tracking data ● Crucial for decision making: ○ Experimentation - test everything ○ Reporting - monitor and alert ○ In production, site-facing applications
  • 5.
    We provide: ● Atrusted repository of metrics ● A self-serve platform for sustainable lifecycle of metrics In production Experimentation Reporting Primary Data Unified Metrics Platform LinkedIn unified metrics platform (UMP)
  • 6.
    Growth of UMPMetrics 2016 20172015 6800 4680 1100 Current: 8000+ metrics
  • 7.
    # code LOAD … #data # transformation # code STORE … # config Metrics: - A = SUM(A’) - B = Unique(id) Downstream: - XLNT - Raptor UMP User Code Platform Generated Code To App To App DefineDeclare Onboard Data Metadata Onboarding process User
  • 8.
  • 9.
    Offline computation flows Hourlyjob latency: 3-6 hours -> want realtime/nearline ...... Metric union User code User code Cubing, Rollup Dimension augmentation HDFS tables Dali views Pinot, Presto Azkaban execution Espresso, Oracle, MySQL
  • 10.
    ... What we wantfor nearline flows Metric unionUser code User code Samza job Dimension augmentation Pinot
  • 11.
    Latency is notthe only requirement Easy to onboard ● Minimum effort to convert existing offline into nearline ● Easy to write user code for new nearline flows Easy to maintain ● Just one version of user code - single source of truth ● Run as a service Latency ● ~5 - 30 mins
  • 12.
    Samza jobs Putting thingstogether Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  • 13.
    Current support User codein Pig ● LOAD, STORE ● FILTER, SAMPLE, SPLIT, UNION ● Simple FOREACH ● JOIN - all semantics ● GROUP/COGROUP, DISTINCT ● Record/Array FLATTEN ● Java UDFs, Python UDFs ● Pig Nested FOREACH and sort/limit (in Windows) ● Hive Not yet
  • 14.
    Under the hoodof the nearline architecture
  • 15.
    Pig to Samzathrough SQL processing Open source framework for building dynamic data management systems. Including: ➢ SQL Parser ➢ Relational algebra APIs ➢ Query planning engine We built UMP nearline with: ➢ Pig’s Grunt parser ➢ Calcite relational algebra ➢ Calcite query planning engine
  • 16.
    Architecture ... Metric union User code Usercode Dimension augmentation Calcite relational algebra as an IR convert generate Samza code optimize Samza physical plan Samza configuration Pig to Calcite Calcite to Samza
  • 17.
    Pig to Calcite #code LOAD … LOAD ... COGROUP ... STORE … GruntParser CO- GROUP LOAD LOAD PigRelConverter FULL OUTERJ OIN AGGRE GATE AGGRE GATE TABLE SCAN TABLE SCAN PRO- JECT User scripts Pig Logical Plan Calcite relational algebra
  • 18.
  • 19.
  • 20.
  • 21.
    Planning/Optimization ➢ Calcite logicalplans: ○ Relational algebra: What to do ➢ Samza physical plans: ○ Samza physical node: How to do it ➢ Calcite Samza planner: ○ Calcite logical plan -> optimized Samza physical plan
  • 22.
    Example Stream- Stream Self Join Samza Project Samza Project Samza Filter Samza Filter Samza Project Input Stream INNER JOIN FILTER FILTER PROJECTPROJECT PROJECT TABLE SCAN TABLE SCAN Calcite Samza planner Calcite logical plan Samza physical plan
  • 23.
    Code-gen From Samza physicalplans: ➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs . Mapping: ➢ Samza physical nodes -> corresponding stream APIs: ○ Samza project -> stream.map() ○ Samza filter -> stream.filter() ○ ... ➢ Relational expressions -> lamba functions: ○ Filter expressions -> filter() functions ○ Project expressions -> map() functions ○ ...
  • 24.
    Example Schema and UDFdeclarations Operator mapping Filter functions Map functions Produce to Kafka ...
  • 25.
  • 26.
    Nearline production usecase - Storylines
  • 27.
  • 28.
    Feedback to editor- powered by UMP realtime
  • 29.
  • 30.
    Samza jobs From improvedLambda architecture... Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  • 31.
    … to ourbigger picture Pig Latin Calcite relational algebra HiveQL SparkSQL/ RDD Presto SQL Portable UDFs AORA (Author Once, Run Anywhere) architecture