Insight Data Engineering: Open source data ingestion

Open Source Data Collection/Ingestion Treasure Data, Inc. www.treasuredata.com

Hello! - “Committer” of Fluentd - Treasure Data, Inc. - Former Algorithmic Trader - Stanford Math and CS

Table of Contents 1. Why you should care 2. Data Collection v. Data Ingestion 3. Examples: Data Collection Tools 4. Examples: Data Ingestion Tools 5. Case Study: Async App Logging Links to be added after the talk.

Data Collection/Ingestion is HARD

Data Sources Raw Data Storage Processed Data Analysis Environment (Big) Data Pipeline Data Collection and Ingestion Data Pre- processing Data Fetching Data Engineers

Data Sources Raw Data Storage Processed Data Analysis Environment If Data Collection Goes Awry... Data Collection and Ingestion Data Pre- processing Data Fetching Data Engineers

Data Collection - Happens where data originates - “logging code” - Batch v. Streaming - Pull v. Push log.error(“FUUUUU....WHY!?”) cln.send({“uid”:1,”action”:”died”}) 200 GET a.com/?utm=big%20data

Data Ingestion - Receives data - Sometimes coupled with storage - Routing data Data Ingestion Layer

rsyslog - The grandfather of data collectors - Streaming - Installed by default, widely understood - Not as easy to extend/configure

rsyslog https://github.com/rsyslog/rsyslog/blob/master/ChangeLog

Scribe - Written originally at Facebook - Streaming - Fast (C++) - Nightmare to build, largely abandoned

Flume-ng - Written and maintained by Cloudera (successor to Flume) - Commercial support by Cloudera. Track record for Hadoop - Java can be heavy-handed for some orgs/cases

Logstash - Pluggable architecture, rich ecosystem - The “L” of the ELK stack by Elastic - JRuby - HA uses Redis as a queue http://apuntesdetrabajo.es/?p=263

Heka - Developed at Mozilla - Written in Go, extensible w/ Lua - Plugin system, but compilation needed (Go’s limitation, may change)

Fluentd - Plugin architecture - Built-in HA - CRuby (JRuby on the roadmap) - google-fluentd, td-agent - Lightweight multi-source, multi- destination log routing

Embulk - Plugin architecture - Focuses on Batch workloads - Java/JRuby - Very new! (looking for contributors!)

RabbitMQ - Written in Erlang, supported by Pivotal - Implements AMQP

Kafka - Begun at LinkedIn, now Confluent - Topic-based Message Broker: Producer/Broker/Consumer - Distributed design - Provides at least once, at most once by consumers

Fluentd!? - Used (abused?) as a bus/MQ - tag-based event routing - Can be combined with RabbitMQ/Kafka, etc.

Application Logging - Common ask: “How’s our new feature doing?” GET /foobar API Server 200 {...}

Application Logging - What NOT to do: synchronous logging GET /foobar API Server200 {...} Data Backend write ack

Application Logging - What NOT to do: synchronous logging GET /foobar API Server200 {...} Local Data Collector write Flush Data Backendack Buffer

- Is writing to a local log collector safe? - What if the log collector retries by error? But wait... - A lot of problems to think about!

“Much of the blame, little of the glory” (Just kidding. The entire data team relies on YOU!)

Thank you! (...and we are hiring!) www.treasuredata.com/careers

- Software - www.fluentd.org - hekad.readthedocs.org - logstash.org - kafka.apache.org - Embulk.org - www.rabbitmq.com - Ideas - https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying - http://radar.oreilly.com/2015/04/the-log-the-lifeblood-of-your-data- pipeline.htmlL Bibliography

Insight Data Engineering: Open source data ingestion

More Related Content

What's hot

Viewers also liked

Similar to Insight Data Engineering: Open source data ingestion

More from Treasure Data, Inc.

Recently uploaded

In this document

Insight Data Engineering: Open source data ingestion