Hw09 Counting And Clustering And Other Data Tricks

Cheap Parlor Tricks, Counting, and Clustering Derek Gottfrid The New York Times October 2009

Evolution of Hadoop @ NYTimes.com

Early Days - 2007 Solution looking for a problem

Solution Wouldn’t it be cool to use lots of EC2 instances (it’s cheap; nobody will notice) Wouldn’t it be cool to use Hadoop (MapReduce Google style is awesome)

Found a Problem Freeing up historical archives of NYTimes.com 1851-1922

Problem Bits Articles are served as PDFs Really need PDFs from 1851-1981 PDFs are dynamically generated Free = more traffic Real deadline

Background What goes into making a PDF of a NYTimes.com article? Each article is made up of many different pieces - multiple columns, different sized headings, multiple pages, photos.

Simple Answer Pre-generate all 11 million PDFs and serve them statically.

Solution Copy all the source data to S3 Use a bunch of EC2 instances and some Hadoop code to generate all the PDFs Store the output PDFs in S3 Serve the PDFs out of S3 w/ a signed query string

A Few Details Limited HDFS - everything loaded in and out of S3 Reduce = 0 - only used for some stats and error reporting

Breakdown 4.3 TB of source data into S3 11M PDFS - 1.5 TB output $240 for EC2 - 24hrs x 100 machines

TimesMachine http://timesmachine.nytimes.com

Currently - 2009 All that darn data - Web Analytics

Data Registration / Demographic Articles 1851 - today Usage Data / Web Logs

Counting Classic cookie tracking - let’s add it up Total PV Total unique users PV per user

A Few Details Using EC2 - 20 Machines Hadoop 0.20.0 12+TB of data Straight MR in Java

Usage Data July 2009 ???M Page Views ??M Unique Users

Merging Data Usage data combined with demographic data.

Twitter Click Backs By Age Group July 2009

Merging Data Usage data with article meta data

Usage Data combined with Article Data July 2009 40 Articles

Clustering Moving beyond simple counting and joining Join usage data, demographic information, and article meta data Apply simple k-means clustering

Conclusion Large scale computing is transformative for NYTimes.com.

Questions? [email_address] @derekg http://open.nytimes.com/

Hw09 Counting And Clustering And Other Data Tricks

More Related Content

What's hot

Viewers also liked

Similar to Hw09 Counting And Clustering And Other Data Tricks

More from Cloudera, Inc.

Recently uploaded

Hw09 Counting And Clustering And Other Data Tricks

Editor's Notes