Log Data Analysis Platform

LOG DATA ANALYSIS PLATFORM May, 2015

Agenda 1) User-Group Introduction 2) Problematic 3) Log Data Analysis System Overview 4) Task Analysis 5) Solution Architecture 6) Trade-off Analysis 7) Automation 8) Performance Testing 9) Outcome & Plans

Demo Lab: Why we’ve started this project? 1) Increase Internal Experience 2) Create Reference Solution w/o NDA Limitations 3) Get Playground for Tests 4) Provide Demo Environment for Customers (using their data) 5) Decrease time to Market (by introducing automation)

LOG DATA ANALYSIS PLATFORM : OVERVIEW

Log Data Analysis Platform Details Key Facts: • ~270-300 Web Servers • Log Types: HTTPD Access logs, Error logs, Application Server Servlet, OS Service Logs • ~500K events per minute • 150GB of data per day Technologies: • Flume • Hadoop/HDFS, MapReduce • Hive, Impala • Oozie • Elasticsearch, Kibana 3 • Tableau Analytics platform • Puppet + Vagrant

Log Data Examples Access log: 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Error log: [Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed [Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome Vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0 iostat Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 0.52 2.03 0.00 91.76

Architecture Drivers: Use Cases

Architecture Drivers: Quality Attributes (1/3)

Architecture Drivers: Limitations

Solution Architecture Batch Layer Serving Layer Speed Layer Raw Data Storage Data Strea m Real-time Views Static Views Precomputing Precomputing Ad-hoc Batch Views Static Batch Views Corporate BI Tool Legend: Layer boundary Data flow (with direction indicated) Query flow Apache HTTP Servers Raw Data Storage Pre-computing Batch Views Real-Time Views Dashboard/ Search Data Stream Real-Time Processing and Aggregations BI Tool  Avro as a Raw Data Storage file format  Parquet as a Batch Views file format  Star schema as a Batch Views data model

Hive Stinger vs Impala Compression Ratio Access Speed

Automation (saves time and money) 80% 20% Development and Debugging F&P Testing, Demo Local Development Cloud Development

Automation Process Phase Tool Notes VM Provisioning Vagrant — Supports: VirtualBox, VMWare ESX, Amazon AWS VM Bootstraping Puppet — Installs Cloudera Manager, Cloudera Distribution Hadoop, ElasticSearch+Kibana, Flume, Microstrategy, Log Generator. — Creates Cluster using Cloudera Manager API. Configure ETL and BI Puppet — Configures Flume, Oozie, ElasticSearch, Impala, Hive, Microstrategy Dashboards Integration Tests Puppet — Generates Workload and ensures data go through. — Checks Logs for errors. — Calculates timing/throughput.

Log Generator 1 Thread can generate: 4200 events / second (File source) 5500 events / second (TCP source)

Accurate Sizing 100k/min 50k/min 20k/min 200k /min Calculator!

Outcome 1) Demo lab, playground, testing platform (in 1 hour) 2) Sizing Calculator 3) Help to get 3 new customers (one is really, really huge) 4) Strategic Partnership with Cloudera 5) Tons of experience and fun  Plans 1) Add support for other Hadoop Distributions (Hortonworks, MapR) 2) Make Project Open-Source

Thank You! 31 SoftServe US Office One Congress Plaza, 111 Congress Avenue, Suite 2700 Austin, TX 78701 Tel: 512.516.8880 Contacts Valentyn Kropov vkrop@softserveinc.com Tel: 866.687.3588 x4341

Log Data Analysis Platform

More Related Content

What's hot

Similar to Log Data Analysis Platform

Recently uploaded

Log Data Analysis Platform