Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza
This document discusses LinkedIn's transition from an offline metrics platform to a near real-time "nearline" architecture using Apache Calcite and Apache Samza. It overviews LinkedIn's metrics platform and needs, and then details how the new nearline architecture works by translating Pig jobs into optimized Samza jobs using Calcite's relational algebra and query planning. An example production use case for analyzing storylines on the LinkedIn platform is also presented. The nearline architecture allows metrics to be computed with latencies of 5-30 minutes rather than 3-6 hours previously.
Introduction to LinkedIn metrics platform and its significance in decision-making and experimentation. UMP provides a trusted metrics repository, supporting over 8000 metrics.
Discussion on the shift from offline to nearline data processing to reduce latency and improve efficiency. Focus on maintaining easy onboarding and a unified codebase.
Technical insights into the nearline architecture, emphasizing the transformation from Pig to Samza using Calcite. Details on planning, optimization and code generation.
Real-world application of UMP in editorial feedback and storyline selection for improved operational efficiency.
Summary of Lambda architecture improvements and future integration with various data processing languages and systems, suggesting a broader architectural vision.
Conquering the Lambda architecture in LinkedIn metrics platform with Apache Calcite and Apache Samza
1.
Conquering the Lambdaarchitecture in LinkedIn metrics platform with Apache Calcite and Apache Samza Khai Tran Staff Software Engineer
2.
Agenda ● Overview ofLinkedIn metrics platform ● Moving from offline to nearline ● Under the hood of the nearline architecture ● Nearline production usecase ● Conclusion
Metrics @ LinkedIn ●Metrics = Measurements over tracking data ● Crucial for decision making: ○ Experimentation - test everything ○ Reporting - monitor and alert ○ In production, site-facing applications
5.
We provide: ● Atrusted repository of metrics ● A self-serve platform for sustainable lifecycle of metrics In production Experimentation Reporting Primary Data Unified Metrics Platform LinkedIn unified metrics platform (UMP)
# code LOAD … #data # transformation # code STORE … # config Metrics: - A = SUM(A’) - B = Unique(id) Downstream: - XLNT - Raptor UMP User Code Platform Generated Code To App To App DefineDeclare Onboard Data Metadata Onboarding process User
Offline computation flows Hourlyjob latency: 3-6 hours -> want realtime/nearline ...... Metric union User code User code Cubing, Rollup Dimension augmentation HDFS tables Dali views Pinot, Presto Azkaban execution Espresso, Oracle, MySQL
10.
... What we wantfor nearline flows Metric unionUser code User code Samza job Dimension augmentation Pinot
11.
Latency is notthe only requirement Easy to onboard ● Minimum effort to convert existing offline into nearline ● Easy to write user code for new nearline flows Easy to maintain ● Just one version of user code - single source of truth ● Run as a service Latency ● ~5 - 30 mins
12.
Samza jobs Putting thingstogether Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
13.
Current support User codein Pig ● LOAD, STORE ● FILTER, SAMPLE, SPLIT, UNION ● Simple FOREACH ● JOIN - all semantics ● GROUP/COGROUP, DISTINCT ● Record/Array FLATTEN ● Java UDFs, Python UDFs ● Pig Nested FOREACH and sort/limit (in Windows) ● Hive Not yet
Pig to Samzathrough SQL processing Open source framework for building dynamic data management systems. Including: ➢ SQL Parser ➢ Relational algebra APIs ➢ Query planning engine We built UMP nearline with: ➢ Pig’s Grunt parser ➢ Calcite relational algebra ➢ Calcite query planning engine
16.
Architecture ... Metric union User code Usercode Dimension augmentation Calcite relational algebra as an IR convert generate Samza code optimize Samza physical plan Samza configuration Pig to Calcite Calcite to Samza
17.
Pig to Calcite #code LOAD … LOAD ... COGROUP ... STORE … GruntParser CO- GROUP LOAD LOAD PigRelConverter FULL OUTERJ OIN AGGRE GATE AGGRE GATE TABLE SCAN TABLE SCAN PRO- JECT User scripts Pig Logical Plan Calcite relational algebra
Planning/Optimization ➢ Calcite logicalplans: ○ Relational algebra: What to do ➢ Samza physical plans: ○ Samza physical node: How to do it ➢ Calcite Samza planner: ○ Calcite logical plan -> optimized Samza physical plan
Samza jobs From improvedLambda architecture... Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
31.
… to ourbigger picture Pig Latin Calcite relational algebra HiveQL SparkSQL/ RDD Presto SQL Portable UDFs AORA (Author Once, Run Anywhere) architecture