Introduction to Apache Drill

Introduction to Apache Drill Michael Hausenblas, Chief Data Engineer EMEA, MapR 6th Swiss Big Data User Group Meeting, Zurich, 2013-03-25 1

Workloads • Batch processing (MapReduce) • Light-weight OLTP (HBase, Cassandra, etc.) • Stream processing (Storm, S4) • Search (Solr, Elasticsearch) • Interactive, ad-hoc query and analysis (?) 3

Interactive Query at Scale Impala low-latency 4

Use Case I • Jane, a marketing analyst • Determine target segments • Data from different sources 5

Use Case II • Logistics – supplier status • Queries – How many shipments from supplier X? – How many shipments in region Y? { "shipment": 100123, SUPPLIER_ID NAME REGION "supplier": "ACM", “timestamp": "2013-02-01", ACM ACME Corp US "description": ”first delivery today” GAL GotALot Inc US }, { BAP Bits and Pieces Ltd Europe "shipment": 100124, "supplier": "BAP", ZUP Zu Pli Asia "timestamp": "2013-02-02", "description": "hope you enjoy it” } 6 …

Today’s Solutions • RDBMS-focused – ETL data from MongoDB and Hadoop – Query data using SQL • MapReduce-focused – ETL from RDBMS and MongoDB – Use Hive, etc. 7

Requirements • Support for different data sources • Support for different query interfaces • Low-latency/real-time • Ad-hoc queries • Scalable, reliable 8

Google’s Dremel http://research.google.com/pubs/pub36632.html 9

Apache Drill Overview • Inspired by Google’s Dremel • Standard SQL 2003 support • Other QL possible • Plug-able data sources • Support for nested data • Schema is optional • Community driven, open, 100’s involved 10

High-level Architecture • Each node: Drillbit - maximize data locality • Co-ordination, query planning, execution, etc, are distributed • By default Drillbits hold all roles • Any node can act as endpoint for a query Drillbit Drillbit Drillbit Drillbit Storage Storage Storage Storage Process Process Process Process node node node node 13

High-level Architecture • Zookeeper for ephemeral cluster membership info • Distributed cache (Hazelcast) for metadata, locality information, etc. Zookeeper Drillbit Drillbit Drillbit Drillbit Distributed Cache Distributed Cache Distributed Cache Distributed Cache Storage Storage Storage Storage Process Process Process Process node node node node 14

High-level Architecture • Originating Drillbit acts as foreman, manages query execution, scheduling, locality information, etc. • Streaming data communication avoiding SerDe Zookeeper Drillbit Drillbit Drillbit Drillbit Distributed Cache Distributed Cache Distributed Cache Distributed Cache Storage Storage Storage Storage Process Process Process Process node node node node 15

Principled Query Execution Source Logical Physical Query Parser Plan Optimizer Plan Execution SQL 2003 parser API query: [ { topology scanner API DrQL @id: "log", op: "sequence", MongoQL do: [ { DSL op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, 16

Drillbit Modules RPC Endpoint SQL Scheduler Storage Engine Interface DFS Engine Physical Plan Logical Plan HiveQL Optimizer Foreman Pig HBase Engine Operators Mongo Parser Distributed Cache 17

Key Features • Full SQL 2003 • Nested data • Optional schema • Extensibility points 18

Full SQL – ANSI SQL 2003 • SQL-like is often not enough • Integration with existing tools – Datameer, Tableau, Excel, SAP Crystal Reports – Use standard ODBC/JDBC driver 19

Nested Data • Nested data becoming prevalent – JSON/BSON, XML, ProtoBuf, Avro – Some data sources support it natively (MongoDB, etc.) • Flattening nested data is error-prone • Extension to ANSI SQL 2003 20

Optional Schema • Many data sources don’t have rigid schemas – Schema changes rapidly – Different schema per record (e.g. HBase) • Supports queries against unknown schema • User can define schema or via discovery 21

Extensibility Points • Source query – parser API • Custom operators, UDF – logical plan • Optimizer • Data sources and formats – scanner API Source Logical Physical Query Parser Plan Optimizer Plan Execution 22

… and Hadoop? • HDFS can be a data source • Complementary use cases … • … use Apache Drill – Find record with specified condition – Aggregation under dynamic conditions • … use MapReduce – Data mining with multiple iterations – ETL https://cloud.google.com/files/BigQueryTechnicalWP.pdf 23 23

Example { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { { "batter”: "sales" : 700.0, [ "typeCount" : 1, { "id": "1001", "type": "Regular" }, "quantity" : 700, { "id": "1002", "type": "Chocolate" }, "ppu" : 1.0 … } { "sales" : 109.71, data source: donuts.json "typeCount" : 2, "quantity" : 159, query:[ { "ppu" : 0.69 op:"sequence", } do:[ { { "sales" : 184.25, op: "scan", "typeCount" : 2, ref: "donuts", "quantity" : 335, source: "local-logs", "ppu" : 0.55 selection: {data: "activity"} } }, { result: out.json op: "filter", expr: "donuts.ppu < 2.00" }, … logical plan: simple_plan.json https://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo 24

Status • Heavy development by multiple organizations • Available – Logical plan (ADSP) – Reference interpreter – Basic SQL parser – Basic demo – Basic HBase back-end 25

Status March/April • Larger SQL syntax • Physical plan • In-memory compressed data interfaces • Distributed execution focused on large cluster high performance sort, aggregation and join 26

Contributing • Dremel-inspired columnar format: Twitter’s Parquet and Hive’s ORC file • Integration with Hive metastore (?) • DRILL-13 Storage Engine: Define Java Interface • DRILL-15 Build HBase storage engine implementation 27

Contributing • DRILL-48 RPC interface for query submission and physical plan execution • DRILL-53 Setup cluster configuration and membership mgmt system – ZK for coordination – Helix for partition and resource assignment (?) • Further schedule – Alpha Q2 – Beta Q3 28

Kudos to … • Julian Hyde, Pentaho • Timothy Chen, Microsoft • Chris Merrick, RJMetrics • David Alves, UT Austin • Sree Vaadi, SSS/NGData • Jacques Nadeau, MapR • Ted Dunning, MapR 29

Engage! • Follow @ApacheDrill on Twitter • Sign up at mailing lists (user | dev) http://incubator.apache.org/drill/mailing-lists.html • Learn where and how to contribute https://cwiki.apache.org/confluence/display/DRILL/Contributing • Keep an eye on http://drill-user.org/ 30

Introduction to Apache Drill

More Related Content

What's hot

Viewers also liked

Similar to Introduction to Apache Drill

More from Swiss Big Data User Group

Recently uploaded

Introduction to Apache Drill

Editor's Notes