Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times October 2009
Evolution of Hadoop @ NYTimes.com
Early Days - 2007 Solution looking for a problem
Solution Wouldn’t it be cool to use lots of EC2 instances (it’s cheap; nobody will notice) Wouldn’t it be cool to use Hadoop (MapReduce Google style is awesome)
Found a Problem Freeing up historical archives of NYTimes.com 1851-1922
Problem Bits Articles are served as PDFs Really need PDFs from 1851-1981 PDFs are dynamically generated Free = more traffic Real deadline
Background What goes into making a PDF of a NYTimes.com article? Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.
Simple Answer Pre-generate all 11 million PDFs and serve them statically.
Solution Copy all the source data to S3 Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs Store the output PDFs in S3 Serve the PDFs out of S3 w/ a signed query string
A Few Details Limited HDFS - everything loaded in and out of S3 Reduce = 0 - only used for some stats and error reporting
Breakdown 4.3 TB of source data into S3 11M PDFS - 1.5 TB output $240 for EC2 - 24hrs x 100 machines
TimesMachine http://timesmachine.nytimes.com
Currently - 2009 All that darn data - Web Analytics
Data Registration / Demographic Articles 1851 - today Usage Data / Web Logs
Counting Classic cookie tracking - let’s add it up Total PV Total unique users PV per user
A Few Details Using EC2 - 20 Machines Hadoop 0.20.0 12+TB of data Straight MR in Java
Usage Data July 2009 ???M Page Views ??M Unique Users
Merging Data Usage data combined with demographic data.
Twitter Click Backs By Age Group July 2009
Merging Data Usage data with article meta data
Usage Data combined with Article Data July 2009 40 Articles
Usage Data combined with Article Data July 2009 40 Articles
Products Coming soon...
Clustering Moving beyond simple counting and joining Join usage data, demographic information, and article meta data Apply simple k-means clustering
Clustering
Clustering
Conclusion Large scale computing is transformative for NYTimes.com.
Questions? [email_address] @derekg http://open.nytimes.com/

Hw09 Counting And Clustering And Other Data Tricks

Editor's Notes

  • #3 The ability to compute across large data is a key to success. Accumulating data is easy. Computing is hard Personify, Omniture, Webtrends - specialized not general computing tools.
  • #14 Fixed reports are great and we can do that but the really interesting part is asking questions after the fact in an adhoc exploratory manner.
  • #15 We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • #16 We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • #17 We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • #18 Distribution of page views per users for the July 2009. Most users view 1 Page. Not new but shows we have some mastery over the data.This is based of just user data - 380G of compressed data covering over 700million page views.
  • #19 We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • #21 We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • #22 mapping, collbrative filtering, segementation,
  • #25 We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • #26 We have always had data. Collecting data is not a issue for us. We are okay at that part.
  • #27 We have always had data. Collecting data is not a issue for us. We are okay at that part.