Agile Analytics Applications Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks Formerly Viz, Data Science at Ning, LinkedIn HBase Dashboards, Career Explorer, InMaps © Hortonworks Inc. 2012 1
Agile Data - The Book (March, 2013) Read it now on OFPS A philosophy, not the only way But still, its good! Really! © Hortonworks Inc. 2012 2
Situation in Brief © Hortonworks Inc. 2012 3
Agile Application Development • LAMP stack mature • Post-Rails frameworks to choose from • We enable rapid feedback and agility + NoSQL © Hortonworks Inc. 2012 4
Data Warehousing © Hortonworks Inc. 2012 5
Scientific Computing / HPC • ‘Smart kid’ only: MPI, Globus, etc. • Math heavy Tubes and Mercury (old school) Cores and Spindles (new school) © Hortonworks Inc. 2012 6
Data Science? Application Data Warehousing Development 33% 33% 33% Scientific Computing / HPC © Hortonworks Inc. 2012 7
Data Center as Computer • Warehouse Scale Computers and applications “A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here. © Hortonworks Inc. 2012 8
Hadoop to the Rescue! Big data refinery / Modernize ETL Audio, Web, Mobile, CRM, Video, ERP, SCM, … Images New Data Business Transactions Docs, Sources Text, & Interactions XML HDFS Web Logs, Clicks Big Data Social, Refinery SQL NoSQL NewSQL Graph, Feeds ETL EDW MPP NewSQL Sensors, Devices, RFID Business Spatial, GPS Apache Hadoop Intelligence & Analytics Events, Other Dashboards, Reports, Visualization, … Page 7 © Hortonworks Inc. 2012 I stole this slide from Eric. 9
Hadoop to the Rescue! • A department can afford a Hadoop cluster, let alone an org • Dump all your data in one place: HDFS • JOIN like crazy! • ETL like whoah! • An army of mappers and reducers at your command • Now what? © Hortonworks Inc. 2012 10
Analytics Apps: It takes a Team • Broad skill-set to make useful apps • Basically nobody has them all • Application development is inherently collaborative © Hortonworks Inc. 2012 11
How to get insight into product? • Back-end has gotten t-h-i-c-k-e-r • Generating $$$ insight can take 10-100x app dev • Timeline disjoint: analytics vs agile app-dev/design • How do you ship insights efficiently? • How do you collaborate on research vs developer timeline? © Hortonworks Inc. 2012 12
The Wrong Way - Part One “We made a great design. Your job is to predict the future for it.” © Hortonworks Inc. 2012 13
The Wrong Way - Part Two “Whats taking you so long to reliably predict the future?” © Hortonworks Inc. 2012 14
The Wrong Way - Part Three “The users don’t understand what 86% true means.” © Hortonworks Inc. 2012 15
The Wrong Way - Part Four GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!! © Hortonworks Inc. 2012 16
The Wrong Way - Inevitable Conclusion Plane Mountain © Hortonworks Inc. 2012 17
Reminds me of... © Hortonworks Inc. 2012 18
Chief Problem You can’t design insight in analytics applications. You discover it. You discover by exploring. © Hortonworks Inc. 2012 19
-> Strategy So make an app for exploring your data. Iterate and publish intermediate results. Which becomes a palette for what you ship. © Hortonworks Inc. 2012 20
Data Design • Not the 1st query that = insight, its the 15th, or the 150th • Capturing “Ah ha!” moments • Slow to do those in batch... • Faster, better context in an interactive web application. • Pre-designed charts wind up terrible. So bad. • Easy to invest man-years in the wrong statistical models • Semantics of presenting predictions are complex, delicate • Opportunity lies at intersection of data & design © Hortonworks Inc. 2012 21
How do we get back to Agile? © Hortonworks Inc. 2012 22
Statement of Principles (then tricks, with code) © Hortonworks Inc. 2012 23
Setup an environment where... • Insights repeatedly produced • Iterative work shared with entire team • Interactive from day 0 • Data model is consistent end-to-end • Minimal impedance between layers • Scope and depth of insights grow • Insights form the palette for what you ship • Until the application pays for itself and more © Hortonworks Inc. 2012 24
Value document > relation Most data is dirty. Most data is semi-structured or un-structured. Rejoice! © Hortonworks Inc. 2012 25
Value document > relation Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction. © Hortonworks Inc. 2012 26
Value imperative > declarative • We don’t know what we want to SELECT. • Data is dirty - check each step, clean iteratively. • 85% of data scientist’s time spent munging. See: ETL. • Imperative is optimized for our process. • Process = iterative, snowballing insight • Efficiency matters, self optimize © Hortonworks Inc. 2012 27
Value dataflow > SELECT © Hortonworks Inc. 2012 28
Example dataflow: ETL + email sent count © Hortonworks Inc. 2012 (I can’t read this either. Get a big version here.) 29
Value Pig > Hive (for app-dev) • Pigs eat ANYTHING • Pig is optimized for refining data, as opposed to consuming it • Pig is imperative, iterative • Pig is dataflows, and SQLish (but not SQL) • Code modularization/re-use: Pig Macros • ILLUSTRATE speeds dev time (even UDFs) • Easy UDFs in Java, JRuby, Jython, Javascript • Pig Streaming = use any tool, period. • Easily prepare our data as it will appear in our app. • If you prefer Hive, use Hive. But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive... See: HCatalog for Pig/Hive integration, and this post. © Hortonworks Inc. 2012 30
Localhost vs Petabyte scale: same tools • Simplicity essential to scalability: highest level tools we can • Prepare a good sample - tricky with joins, easy with documents • Local mode: pig -l /tmp -x local -v -w • Frequent use of ILLUSTRATE • 1st: Iterate, debug & publish locally • 2nd: Run on cluster, publish to team/customer • Consider skipping Object-Relational-Mapping (ORM) • We do not trust ‘databases,’ only HDFS @ n=3. • Everything we serve in our app is re-creatable via Hadoop. © Hortonworks Inc. 2012 31
Data-Value Pyramid Climb it. Do not skip steps. See here. © Hortonworks Inc. 2012 32
0/1) Display atomic records on the web © Hortonworks Inc. 2012 33
0.0) Document-serialize events • Protobuf • Thrift • JSON • Avro - I use Avro because the schema is onboard. © Hortonworks Inc. 2012 34
0.1) Documents via Relation ETL enron_messages = load '/enron/enron_messages.tsv' as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray );   enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);   split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';   headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10; with_headers = join headers by group, enron_messages by message_id parallel 10; emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs; store emails into '/enron/emails.avro' using AvroStorage( Example here. © Hortonworks Inc. 2012 35
0.2) Serialize events from streams class GmailSlurper(object): ...   def init_imap(self, username, password):     self.username = username     self.password = password     try:       imap.shutdown()     except:       pass     self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)     self.imap.login(username, password)     self.imap.is_readonly = True ...   def write(self, record):     self.avro_writer.append(record) ...   def slurp(self):     if(self.imap and self.imap_folder):       for email_id in self.id_list:         (status, email_hash, charset) = self.fetch_email(email_id)         if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):           print email_id, charset, email_hash['thread_id']           self.write(email_hash) © Hortonworks Inc. 2012 Scrape your own gmail in Python and Ruby. 36
0.3) ETL Logs log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes); © Hortonworks Inc. 2012 37
1) Plumb atomic events -> browser (Example stack that enables high productivity) © Hortonworks Inc. 2012 38
Lots of Stack Options with Examples • Pig with Voldemort, Ruby, Sinatra: example • Pig with ElasticSearch: example • Pig with MongoDB, Node.js: example • Pig with Cassandra, Python Streaming, Flask: example • Pig with HBase, JRuby, Sinatra: example • Pig with Hive via HCatalog: example (trivial on HDP) • Up next: Accumulo, Redis, MySQL, etc. © Hortonworks Inc. 2012 39
1.1) cat our Avro serialized events me$ cat_avro ~/Data/enron.avro { u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ] } © Hortonworks Inc. 2012 Get cat_avro in python, ruby 40
1.2) Load our events in Pig me$ pig -l /tmp -x local -v -w grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage(); grunt> describe enron_emails emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)} }   © Hortonworks Inc. 2012 41
1.3) ILLUSTRATE our events in Pig grunt> illustrate enron_emails   --------------------------------------------------------------------------- | emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} | | ccs:bag{cc:tuple(address:chararray,name:chararray)} | | bccs:bag{bcc:tuple(address:chararray,name:chararray)} | --------------------------------------------------------------------------- | | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} | Upgrade to Pig 0.10+ © Hortonworks Inc. 2012 42
1.4) Publish our events to a ‘database’ From Avro to MongoDB in one command: pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig Which does this: /* MongoDB libraries and configuration */ register /me/mongo-hadoop/mongo-2.7.3.jar register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar /* Set speculative execution off to avoid chance of duplicate records in Mongo */ set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */ /* By default, lets have 5 reducers */ set default_parallel 5 avros = load '$avros' using AvroStorage(); store avros into '$mongourl' using MongoStorage(); © Hortonworks Inc. 2012 Full instructions here. 43
1.5) Check events in our ‘database’ $ mongo enron MongoDB shell version: 2.0.2 connecting to: enron > show collections emails system.indexes > db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}) { " "_id" : ObjectId("502b4ae703643a6a49c8d180"), " "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", " "date" : "2001-01-09T06:38:00.000Z", " "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }, " "subject" : Re: Enron trade for frop futures, " "body" : "Scamming more people...", " "tos" : [ { "address" : "connie@enron", "name" : null } ], " "ccs" : [ ], " "bccs" : [ ] } © Hortonworks Inc. 2012 44
1.6) Publish events on the web require 'rubygems' require 'sinatra' require 'mongo' require 'json' connection = Mongo::Connection.new database = connection['agile_data'] collection = database['emails'] get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data) end © Hortonworks Inc. 2012 45
1.6) Publish events on the web © Hortonworks Inc. 2012 46
Whats the point? • A designer can work against real data. • An application developer can work against real data. • A product manager can think in terms of real data. • Entire team is grounded in reality! • You’ll see how ugly your data really is. • You’ll see how much work you have yet to do. • Ship early and often! • Feels agile, don’t it? Keep it up! © Hortonworks Inc. 2012 47
1.7) Wrap events with Bootstrap <link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"> </head> <body> <div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data['keys'] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data['values'] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table> </div> </body> Complete example here with code here. © Hortonworks Inc. 2012 48
1.7) Wrap events with Bootstrap © Hortonworks Inc. 2012 49
Refine. Add links between documents. Not the Mona Lisa, but coming along... See: here © Hortonworks Inc. 2012 50
The Mona Lisa. In pure CSS. © Hortonworks Inc. 2012 See: here 51
1.8) List links to sorted events Use Pig, serve/cache a bag/array of email documents: pig -l /tmp -x local -v -w emails_per_user = foreach (group emails by from.address) { sorted = order emails by date; last_1000 = limit sorted 1000; generate group as from_address, emails as emails; }; store emails_per_user into '$mongourl' using MongoStorage(); Use your ‘database’, if it can sort. mongo enron > db.emails.ensureIndex({message_id: 1}) > db.emails.find().sort({date:0}).limit(10).pretty() { { " "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), " "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", " "from" : [ ... © Hortonworks Inc. 2012 52
1.8) List links to sorted documents © Hortonworks Inc. 2012 53
1.9) Make it searchable... If you have list, search is easy with ElasticSearch and Wonderdog... /* Load ElasticSearch integration */ register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar'; register '/me/elasticsearch-0.18.6/lib/*'; define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage(); emails = load '/me/tmp/emails' using AvroStorage(); store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/ elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins'); Test it with curl: curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1' ElasticSearch has no security features. Take note. Isolate. © Hortonworks Inc. 2012 54
From now on we speed up... Don’t worry, its in the book and on the blog. http://hortonworks.com/blog/ © Hortonworks Inc. 2012 55
2) Create Simple Charts © Hortonworks Inc. 2012 56
2) Create Simple Tables and Charts © Hortonworks Inc. 2012 57
2) Create Simple Charts • Start with an HTML table on general principle. • Then use nvd3.js - reusable charts for d3.js • Aggregate by properties & displaying is first step in entity resolution • Start extracting entities. Ex: people, places, topics, time series • Group documents by entities, rank and count. • Publish top N, time series, etc. • Fill a page with charts. • Add a chart to your event page. © Hortonworks Inc. 2012 58
2.1) Top N (of anything) in Pig pig -l /tmp -x local -v -w top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc; top_10_things = limit sorted 10; generate group as key, top_10_things as top_10_things; }; store top_n into '$mongourl' using MongoStorage(); Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. © Hortonworks Inc. 2012 59
2.2) Time Series (of anything) in Pig pig -l /tmp -x local -v -w /* Group by our key and date rounded to the month, get a total */ things_by_month = foreach (group things by (key, ISOToMonth(datetime)) generate flatten(group) as (key, month), COUNT_STAR(things) as total; /* Sort our totals per key by month to get a time series */ things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Yet another good Pig Macro. © Hortonworks Inc. 2012 60
Data processing in our stack A new feature in our application might begin at any layer... great! omghi2u! I’m creative! I’m creative too! where r my legs? I know Pig! I <3 Javascript! send halp Any team member can add new features, no problemo! © Hortonworks Inc. 2012 61
Data processing in our stack ... but we shift the data-processing towards batch, as we are able. See real example here. Ex: Overall total emails calculated in each layer © Hortonworks Inc. 2012 62
3) Exploring with Reports © Hortonworks Inc. 2012 63
3) Exploring with Reports © Hortonworks Inc. 2012 64
3.0) From charts to reports... • Extract entities from properties we aggregated by in charts (Step 2) • Each entity gets its own type of web page • Each unique entity gets its own web page • Link to entities as they appear in atomic event documents (Step 1) • Link most related entities together, same and between types. • More visualizations! • Parametize results via forms. © Hortonworks Inc. 2012 65
3.1) Looks like this... © Hortonworks Inc. 2012 66
3.2) Cultivate common keyspaces © Hortonworks Inc. 2012 67
3.3) Get people clicking. Learn. • Explore this web of generated pages, charts and links! • Everyone on the team gets to know your data. • Keep trying out different charts, metrics, entities, links. • See whats interesting. • Figure out what data needs cleaning and clean it. • Start thinking about predictions & recommendations. ‘People’ could be just your team, if data is sensitive. © Hortonworks Inc. 2012 68
4) Predictions and Recommendations © Hortonworks Inc. 2012 69
4.0) Preparation • We’ve already extracted entities, their properties and relationships • Our charts show where our signal is rich • We’ve cleaned our data to make it presentable • The entire team has an intuitive understanding of the data • They got that understanding by exploring the data • We are all on the same page! © Hortonworks Inc. 2012 70
4.1) Smooth sparse data © Hortonworks Inc. 2012 See here. 71
4.2) Think in different perspectives • Networks • Time Series • Distributions • Natural Language • Probability / Bayes © Hortonworks Inc. 2012 See here. 72
4.3) Sink more time in deeper analysis TF-IDF import 'tfidf.macro'; my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body'); /* Get the top 10 Tf*Idf scores per message */ per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value); } Probability / Bayes sent_replies = join sent_counts by (from, to), reply_counts by (from, to); reply_ratios = foreach sent_replies generate sent_counts::from as from, sent_counts::to as to, (float)reply_counts::total/(float)sent_counts::tot as ratio; reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio; © Hortonworks Inc. 2012 Example with code here and macro here. 73
4.4) Add predictions to reports © Hortonworks Inc. 2012 74
5) Enable new actions © Hortonworks Inc. 2012 75
Example: Packetpig and PacketLoop snort_alerts = LOAD '$pcap'   USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig'); countries = FOREACH snort_alerts   GENERATE     com.packetloop.packetpig.udf.geoip.Country(src) as country,     priority; countries = GROUP countries BY country; countries = FOREACH countries   GENERATE     group,     AVG(countries.priority) as average_severity; STORE countries into 'output/choropleth_countries' using PigStorage(','); Code here. © Hortonworks Inc. 2012 76
Example: Packetpig and PacketLoop © Hortonworks Inc. 2012 77
Hortonworks Data Platform • Simplify deployment to get started quickly and easily • Monitor, manage any size cluster with familiar console and tools • Only platform to include data 1 integration services to interact with any data • Metadata services opens the platform for integration with existing applications • Dependable high availability architecture  Reduce risks and cost of adoption • Tested at scale to future proof  Lower the total cost to administer and provision your cluster growth  Integrate with your existing ecosystem © Hortonworks Inc. 2012 78
Hortonworks Training The expert source for Apache Hadoop training & certification Role-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses available Comprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert © Hortonworks Inc. 2012 79
Next Steps? 1 Download Hortonworks Data Platform hortonworks.com/download 2 Use the getting started guide hortonworks.com/get-started 3 Learn more… get support Hortonworks Support • Expert role based training • Full lifecycle technical support • Course for admins, developers across four service levels and operators • Delivered by Apache Hadoop • Certification program Experts/Committers • Custom onsite options • Forward-compatible hortonworks.com/training hortonworks.com/support © Hortonworks Inc. 2012 80
Thank You! Questions & Answers Slides: http://slidesha.re/O8kjaF Follow: @hortonworks and @rjurney Read: hortonworks.com/blog © Hortonworks Inc. 2012 81

Agile analytics applications on hadoop

  • 1.
    Agile Analytics Applications RussellJurney (@rjurney) - Hadoop Evangelist @Hortonworks Formerly Viz, Data Science at Ning, LinkedIn HBase Dashboards, Career Explorer, InMaps © Hortonworks Inc. 2012 1
  • 2.
    Agile Data -The Book (March, 2013) Read it now on OFPS A philosophy, not the only way But still, its good! Really! © Hortonworks Inc. 2012 2
  • 3.
    Situation in Brief © Hortonworks Inc. 2012 3
  • 4.
    Agile Application Development •LAMP stack mature • Post-Rails frameworks to choose from • We enable rapid feedback and agility + NoSQL © Hortonworks Inc. 2012 4
  • 5.
    Data Warehousing © Hortonworks Inc. 2012 5
  • 6.
    Scientific Computing /HPC • ‘Smart kid’ only: MPI, Globus, etc. • Math heavy Tubes and Mercury (old school) Cores and Spindles (new school) © Hortonworks Inc. 2012 6
  • 7.
    Data Science? Application Data Warehousing Development 33% 33% 33% Scientific Computing / HPC © Hortonworks Inc. 2012 7
  • 8.
    Data Center asComputer • Warehouse Scale Computers and applications “A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here. © Hortonworks Inc. 2012 8
  • 9.
    Hadoop to theRescue! Big data refinery / Modernize ETL Audio, Web, Mobile, CRM, Video, ERP, SCM, … Images New Data Business Transactions Docs, Sources Text, & Interactions XML HDFS Web Logs, Clicks Big Data Social, Refinery SQL NoSQL NewSQL Graph, Feeds ETL EDW MPP NewSQL Sensors, Devices, RFID Business Spatial, GPS Apache Hadoop Intelligence & Analytics Events, Other Dashboards, Reports, Visualization, … Page 7 © Hortonworks Inc. 2012 I stole this slide from Eric. 9
  • 10.
    Hadoop to theRescue! • A department can afford a Hadoop cluster, let alone an org • Dump all your data in one place: HDFS • JOIN like crazy! • ETL like whoah! • An army of mappers and reducers at your command • Now what? © Hortonworks Inc. 2012 10
  • 11.
    Analytics Apps: Ittakes a Team • Broad skill-set to make useful apps • Basically nobody has them all • Application development is inherently collaborative © Hortonworks Inc. 2012 11
  • 12.
    How to getinsight into product? • Back-end has gotten t-h-i-c-k-e-r • Generating $$$ insight can take 10-100x app dev • Timeline disjoint: analytics vs agile app-dev/design • How do you ship insights efficiently? • How do you collaborate on research vs developer timeline? © Hortonworks Inc. 2012 12
  • 13.
    The Wrong Way- Part One “We made a great design. Your job is to predict the future for it.” © Hortonworks Inc. 2012 13
  • 14.
    The Wrong Way- Part Two “Whats taking you so long to reliably predict the future?” © Hortonworks Inc. 2012 14
  • 15.
    The Wrong Way- Part Three “The users don’t understand what 86% true means.” © Hortonworks Inc. 2012 15
  • 16.
    The Wrong Way- Part Four GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!! © Hortonworks Inc. 2012 16
  • 17.
    The Wrong Way- Inevitable Conclusion Plane Mountain © Hortonworks Inc. 2012 17
  • 18.
    Reminds me of... © Hortonworks Inc. 2012 18
  • 19.
    Chief Problem You can’tdesign insight in analytics applications. You discover it. You discover by exploring. © Hortonworks Inc. 2012 19
  • 20.
    -> Strategy So make an app for exploring your data. Iterate and publish intermediate results. Which becomes a palette for what you ship. © Hortonworks Inc. 2012 20
  • 21.
    Data Design • Notthe 1st query that = insight, its the 15th, or the 150th • Capturing “Ah ha!” moments • Slow to do those in batch... • Faster, better context in an interactive web application. • Pre-designed charts wind up terrible. So bad. • Easy to invest man-years in the wrong statistical models • Semantics of presenting predictions are complex, delicate • Opportunity lies at intersection of data & design © Hortonworks Inc. 2012 21
  • 22.
    How do weget back to Agile? © Hortonworks Inc. 2012 22
  • 23.
    Statement of Principles (then tricks, with code) © Hortonworks Inc. 2012 23
  • 24.
    Setup an environmentwhere... • Insights repeatedly produced • Iterative work shared with entire team • Interactive from day 0 • Data model is consistent end-to-end • Minimal impedance between layers • Scope and depth of insights grow • Insights form the palette for what you ship • Until the application pays for itself and more © Hortonworks Inc. 2012 24
  • 25.
    Value document >relation Most data is dirty. Most data is semi-structured or un-structured. Rejoice! © Hortonworks Inc. 2012 25
  • 26.
    Value document >relation Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction. © Hortonworks Inc. 2012 26
  • 27.
    Value imperative >declarative • We don’t know what we want to SELECT. • Data is dirty - check each step, clean iteratively. • 85% of data scientist’s time spent munging. See: ETL. • Imperative is optimized for our process. • Process = iterative, snowballing insight • Efficiency matters, self optimize © Hortonworks Inc. 2012 27
  • 28.
    Value dataflow >SELECT © Hortonworks Inc. 2012 28
  • 29.
    Example dataflow: ETL+ email sent count © Hortonworks Inc. 2012 (I can’t read this either. Get a big version here.) 29
  • 30.
    Value Pig >Hive (for app-dev) • Pigs eat ANYTHING • Pig is optimized for refining data, as opposed to consuming it • Pig is imperative, iterative • Pig is dataflows, and SQLish (but not SQL) • Code modularization/re-use: Pig Macros • ILLUSTRATE speeds dev time (even UDFs) • Easy UDFs in Java, JRuby, Jython, Javascript • Pig Streaming = use any tool, period. • Easily prepare our data as it will appear in our app. • If you prefer Hive, use Hive. But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive... See: HCatalog for Pig/Hive integration, and this post. © Hortonworks Inc. 2012 30
  • 31.
    Localhost vs Petabytescale: same tools • Simplicity essential to scalability: highest level tools we can • Prepare a good sample - tricky with joins, easy with documents • Local mode: pig -l /tmp -x local -v -w • Frequent use of ILLUSTRATE • 1st: Iterate, debug & publish locally • 2nd: Run on cluster, publish to team/customer • Consider skipping Object-Relational-Mapping (ORM) • We do not trust ‘databases,’ only HDFS @ n=3. • Everything we serve in our app is re-creatable via Hadoop. © Hortonworks Inc. 2012 31
  • 32.
    Data-Value Pyramid Climb it. Do not skip steps. See here. © Hortonworks Inc. 2012 32
  • 33.
    0/1) Display atomicrecords on the web © Hortonworks Inc. 2012 33
  • 34.
    0.0) Document-serialize events •Protobuf • Thrift • JSON • Avro - I use Avro because the schema is onboard. © Hortonworks Inc. 2012 34
  • 35.
    0.1) Documents viaRelation ETL enron_messages = load '/enron/enron_messages.tsv' as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray );   enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);   split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';   headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10; with_headers = join headers by group, enron_messages by message_id parallel 10; emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs; store emails into '/enron/emails.avro' using AvroStorage( Example here. © Hortonworks Inc. 2012 35
  • 36.
    0.2) Serialize eventsfrom streams class GmailSlurper(object): ...   def init_imap(self, username, password):     self.username = username     self.password = password     try:       imap.shutdown()     except:       pass     self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)     self.imap.login(username, password)     self.imap.is_readonly = True ...   def write(self, record):     self.avro_writer.append(record) ...   def slurp(self):     if(self.imap and self.imap_folder):       for email_id in self.id_list:         (status, email_hash, charset) = self.fetch_email(email_id)         if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):           print email_id, charset, email_hash['thread_id']           self.write(email_hash) © Hortonworks Inc. 2012 Scrape your own gmail in Python and Ruby. 36
  • 37.
    0.3) ETL Logs log_data= LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes); © Hortonworks Inc. 2012 37
  • 38.
    1) Plumb atomicevents -> browser (Example stack that enables high productivity) © Hortonworks Inc. 2012 38
  • 39.
    Lots of StackOptions with Examples • Pig with Voldemort, Ruby, Sinatra: example • Pig with ElasticSearch: example • Pig with MongoDB, Node.js: example • Pig with Cassandra, Python Streaming, Flask: example • Pig with HBase, JRuby, Sinatra: example • Pig with Hive via HCatalog: example (trivial on HDP) • Up next: Accumulo, Redis, MySQL, etc. © Hortonworks Inc. 2012 39
  • 40.
    1.1) cat ourAvro serialized events me$ cat_avro ~/Data/enron.avro { u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ] } © Hortonworks Inc. 2012 Get cat_avro in python, ruby 40
  • 41.
    1.2) Load ourevents in Pig me$ pig -l /tmp -x local -v -w grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage(); grunt> describe enron_emails emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)} }   © Hortonworks Inc. 2012 41
  • 42.
    1.3) ILLUSTRATE ourevents in Pig grunt> illustrate enron_emails   --------------------------------------------------------------------------- | emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} | | ccs:bag{cc:tuple(address:chararray,name:chararray)} | | bccs:bag{bcc:tuple(address:chararray,name:chararray)} | --------------------------------------------------------------------------- | | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} | Upgrade to Pig 0.10+ © Hortonworks Inc. 2012 42
  • 43.
    1.4) Publish ourevents to a ‘database’ From Avro to MongoDB in one command: pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig Which does this: /* MongoDB libraries and configuration */ register /me/mongo-hadoop/mongo-2.7.3.jar register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar /* Set speculative execution off to avoid chance of duplicate records in Mongo */ set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */ /* By default, lets have 5 reducers */ set default_parallel 5 avros = load '$avros' using AvroStorage(); store avros into '$mongourl' using MongoStorage(); © Hortonworks Inc. 2012 Full instructions here. 43
  • 44.
    1.5) Check eventsin our ‘database’ $ mongo enron MongoDB shell version: 2.0.2 connecting to: enron > show collections emails system.indexes > db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}) { " "_id" : ObjectId("502b4ae703643a6a49c8d180"), " "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", " "date" : "2001-01-09T06:38:00.000Z", " "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }, " "subject" : Re: Enron trade for frop futures, " "body" : "Scamming more people...", " "tos" : [ { "address" : "connie@enron", "name" : null } ], " "ccs" : [ ], " "bccs" : [ ] } © Hortonworks Inc. 2012 44
  • 45.
    1.6) Publish eventson the web require 'rubygems' require 'sinatra' require 'mongo' require 'json' connection = Mongo::Connection.new database = connection['agile_data'] collection = database['emails'] get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data) end © Hortonworks Inc. 2012 45
  • 46.
    1.6) Publish eventson the web © Hortonworks Inc. 2012 46
  • 47.
    Whats the point? •A designer can work against real data. • An application developer can work against real data. • A product manager can think in terms of real data. • Entire team is grounded in reality! • You’ll see how ugly your data really is. • You’ll see how much work you have yet to do. • Ship early and often! • Feels agile, don’t it? Keep it up! © Hortonworks Inc. 2012 47
  • 48.
    1.7) Wrap eventswith Bootstrap <link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"> </head> <body> <div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data['keys'] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data['values'] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table> </div> </body> Complete example here with code here. © Hortonworks Inc. 2012 48
  • 49.
    1.7) Wrap eventswith Bootstrap © Hortonworks Inc. 2012 49
  • 50.
    Refine. Add linksbetween documents. Not the Mona Lisa, but coming along... See: here © Hortonworks Inc. 2012 50
  • 51.
    The Mona Lisa.In pure CSS. © Hortonworks Inc. 2012 See: here 51
  • 52.
    1.8) List linksto sorted events Use Pig, serve/cache a bag/array of email documents: pig -l /tmp -x local -v -w emails_per_user = foreach (group emails by from.address) { sorted = order emails by date; last_1000 = limit sorted 1000; generate group as from_address, emails as emails; }; store emails_per_user into '$mongourl' using MongoStorage(); Use your ‘database’, if it can sort. mongo enron > db.emails.ensureIndex({message_id: 1}) > db.emails.find().sort({date:0}).limit(10).pretty() { { " "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), " "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", " "from" : [ ... © Hortonworks Inc. 2012 52
  • 53.
    1.8) List linksto sorted documents © Hortonworks Inc. 2012 53
  • 54.
    1.9) Make itsearchable... If you have list, search is easy with ElasticSearch and Wonderdog... /* Load ElasticSearch integration */ register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar'; register '/me/elasticsearch-0.18.6/lib/*'; define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage(); emails = load '/me/tmp/emails' using AvroStorage(); store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/ elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins'); Test it with curl: curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1' ElasticSearch has no security features. Take note. Isolate. © Hortonworks Inc. 2012 54
  • 55.
    From now onwe speed up... Don’t worry, its in the book and on the blog. http://hortonworks.com/blog/ © Hortonworks Inc. 2012 55
  • 56.
    2) Create SimpleCharts © Hortonworks Inc. 2012 56
  • 57.
    2) Create SimpleTables and Charts © Hortonworks Inc. 2012 57
  • 58.
    2) Create SimpleCharts • Start with an HTML table on general principle. • Then use nvd3.js - reusable charts for d3.js • Aggregate by properties & displaying is first step in entity resolution • Start extracting entities. Ex: people, places, topics, time series • Group documents by entities, rank and count. • Publish top N, time series, etc. • Fill a page with charts. • Add a chart to your event page. © Hortonworks Inc. 2012 58
  • 59.
    2.1) Top N(of anything) in Pig pig -l /tmp -x local -v -w top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc; top_10_things = limit sorted 10; generate group as key, top_10_things as top_10_things; }; store top_n into '$mongourl' using MongoStorage(); Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. © Hortonworks Inc. 2012 59
  • 60.
    2.2) Time Series(of anything) in Pig pig -l /tmp -x local -v -w /* Group by our key and date rounded to the month, get a total */ things_by_month = foreach (group things by (key, ISOToMonth(datetime)) generate flatten(group) as (key, month), COUNT_STAR(things) as total; /* Sort our totals per key by month to get a time series */ things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Yet another good Pig Macro. © Hortonworks Inc. 2012 60
  • 61.
    Data processing inour stack A new feature in our application might begin at any layer... great! omghi2u! I’m creative! I’m creative too! where r my legs? I know Pig! I <3 Javascript! send halp Any team member can add new features, no problemo! © Hortonworks Inc. 2012 61
  • 62.
    Data processing inour stack ... but we shift the data-processing towards batch, as we are able. See real example here. Ex: Overall total emails calculated in each layer © Hortonworks Inc. 2012 62
  • 63.
    3) Exploring withReports © Hortonworks Inc. 2012 63
  • 64.
    3) Exploring withReports © Hortonworks Inc. 2012 64
  • 65.
    3.0) From chartsto reports... • Extract entities from properties we aggregated by in charts (Step 2) • Each entity gets its own type of web page • Each unique entity gets its own web page • Link to entities as they appear in atomic event documents (Step 1) • Link most related entities together, same and between types. • More visualizations! • Parametize results via forms. © Hortonworks Inc. 2012 65
  • 66.
    3.1) Looks likethis... © Hortonworks Inc. 2012 66
  • 67.
    3.2) Cultivate commonkeyspaces © Hortonworks Inc. 2012 67
  • 68.
    3.3) Get peopleclicking. Learn. • Explore this web of generated pages, charts and links! • Everyone on the team gets to know your data. • Keep trying out different charts, metrics, entities, links. • See whats interesting. • Figure out what data needs cleaning and clean it. • Start thinking about predictions & recommendations. ‘People’ could be just your team, if data is sensitive. © Hortonworks Inc. 2012 68
  • 69.
    4) Predictions andRecommendations © Hortonworks Inc. 2012 69
  • 70.
    4.0) Preparation • We’vealready extracted entities, their properties and relationships • Our charts show where our signal is rich • We’ve cleaned our data to make it presentable • The entire team has an intuitive understanding of the data • They got that understanding by exploring the data • We are all on the same page! © Hortonworks Inc. 2012 70
  • 71.
    4.1) Smooth sparsedata © Hortonworks Inc. 2012 See here. 71
  • 72.
    4.2) Think indifferent perspectives • Networks • Time Series • Distributions • Natural Language • Probability / Bayes © Hortonworks Inc. 2012 See here. 72
  • 73.
    4.3) Sink moretime in deeper analysis TF-IDF import 'tfidf.macro'; my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body'); /* Get the top 10 Tf*Idf scores per message */ per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value); } Probability / Bayes sent_replies = join sent_counts by (from, to), reply_counts by (from, to); reply_ratios = foreach sent_replies generate sent_counts::from as from, sent_counts::to as to, (float)reply_counts::total/(float)sent_counts::tot as ratio; reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio; © Hortonworks Inc. 2012 Example with code here and macro here. 73
  • 74.
    4.4) Add predictionsto reports © Hortonworks Inc. 2012 74
  • 75.
    5) Enable newactions © Hortonworks Inc. 2012 75
  • 76.
    Example: Packetpig andPacketLoop snort_alerts = LOAD '$pcap'   USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig'); countries = FOREACH snort_alerts   GENERATE     com.packetloop.packetpig.udf.geoip.Country(src) as country,     priority; countries = GROUP countries BY country; countries = FOREACH countries   GENERATE     group,     AVG(countries.priority) as average_severity; STORE countries into 'output/choropleth_countries' using PigStorage(','); Code here. © Hortonworks Inc. 2012 76
  • 77.
    Example: Packetpig andPacketLoop © Hortonworks Inc. 2012 77
  • 78.
    Hortonworks Data Platform • Simplify deployment to get started quickly and easily • Monitor, manage any size cluster with familiar console and tools • Only platform to include data 1 integration services to interact with any data • Metadata services opens the platform for integration with existing applications • Dependable high availability architecture  Reduce risks and cost of adoption • Tested at scale to future proof  Lower the total cost to administer and provision your cluster growth  Integrate with your existing ecosystem © Hortonworks Inc. 2012 78
  • 79.
    Hortonworks Training The expert source for Apache Hadoop training & certification Role-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses available Comprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert © Hortonworks Inc. 2012 79
  • 80.
    Next Steps? 1 Download Hortonworks Data Platform hortonworks.com/download 2 Use the getting started guide hortonworks.com/get-started 3 Learn more… get support Hortonworks Support • Expert role based training • Full lifecycle technical support • Course for admins, developers across four service levels and operators • Delivered by Apache Hadoop • Certification program Experts/Committers • Custom onsite options • Forward-compatible hortonworks.com/training hortonworks.com/support © Hortonworks Inc. 2012 80
  • 81.
    Thank You! Questions &Answers Slides: http://slidesha.re/O8kjaF Follow: @hortonworks and @rjurney Read: hortonworks.com/blog © Hortonworks Inc. 2012 81

Editor's Notes