Counting at Scale APAM E4990 Modeling Social Data Jake Hofman Columbia University February 6, 2013 Jake Hofman (Columbia University) Counting at Scale February 6, 2013 1 / 27
Last week Claim: Solving the counting problem at scale enables you to investigate many interesting questions in the social sciences Jake Hofman (Columbia University) Counting at Scale February 6, 2013 2 / 27
Learning to count Last week: Counting at small/medium scales on a single machine Jake Hofman (Columbia University) Counting at Scale February 6, 2013 3 / 27
Learning to count Last week: Counting at small/medium scales on a single machine This week: Counting at large scales in parallel Jake Hofman (Columbia University) Counting at Scale February 6, 2013 3 / 27
What? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 4 / 27
What? “... to create building blocks for programmers who just happen to have lots of data to store, lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.” -Tom White Hadoop: The Definitive Guide Jake Hofman (Columbia University) Counting at Scale February 6, 2013 5 / 27
What? Hadoop contains many subprojects: We’ll focus on distributed computation with MapReduce. Jake Hofman (Columbia University) Counting at Scale February 6, 2013 6 / 27
Who/when? An overly brief history Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Who/when? pre-2004 Doug Cutting and Mike Cafarella develop open source projects for web-scale indexing, crawling, and search Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Who/when? 2004 Dean and Ghemawat publish MapReduce programming model, used internally at Google Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Who/when? 2006 Hadoop becomes official Apache project, Cutting joins Yahoo!, Yahoo adopts Hadoop Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Who/when? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
Where? http://wiki.apache.org/hadoop/PoweredBy Jake Hofman (Columbia University) Counting at Scale February 6, 2013 8 / 27
Why? Why yet another solution? (I already use too many languages/environments) Jake Hofman (Columbia University) Counting at Scale February 6, 2013 9 / 27
Why? Why a distributed solution? (My desktop has TBs of storage and GBs of memory) Jake Hofman (Columbia University) Counting at Scale February 6, 2013 9 / 27
Why? Roughly how long to read 1TB from a commodity hard disk? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 10 / 27
Why? Roughly how long to read 1TB from a commodity hard disk? 1 2 Gb sec × 1 8 B b × 3600 sec hr ≈ 225 GB hr Jake Hofman (Columbia University) Counting at Scale February 6, 2013 10 / 27
Why? Roughly how long to read 1TB from a commodity hard disk? ≈ 4hrs Jake Hofman (Columbia University) Counting at Scale February 6, 2013 10 / 27
Why? http://bit.ly/petabytesort Jake Hofman (Columbia University) Counting at Scale February 6, 2013 11 / 27
Typical scenario Store, parse, and analyze high-volume server logs, e.g. how many search queries match “icwsm”? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 12 / 27
MapReduce: 30k ft Break large problem into smaller parts, solve in parallel, combine results Jake Hofman (Columbia University) Counting at Scale February 6, 2013 13 / 27
Typical scenario “Embarassingly parallel” (or nearly so) node 1 local read filter node 2 local read filter node 3 local read filter node 4 local read filter }collect results Jake Hofman (Columbia University) Counting at Scale February 6, 2013 14 / 27
Typical scenario++ How many search queries match “icwsm”, grouped by month? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 15 / 27
MapReduce: example 20091201,4.2.2.1,"icwsm 2010" 20100523,2.4.1.2,"hadoop" 20100101,9.7.6.5,"tutorial" 20091125,2.4.6.1,"data" 20090708,4.2.2.1,"open source" 20100124,1.2.2.4,"washington dc" 20100522,2.4.1.2,"conference" 20091008,4.2.2.1,"2009 icwsm" 20090807,4.2.2.1,"apache.org" 20100101,9.7.6.5,"mapreduce" 20100123,1.2.2.4,"washington dc" 20091121,2.4.6.1,"icwsm dates" 20090807,4.2.2.1,"distributed" 20091225,4.2.2.1,"icwsm" 20100522,2.4.1.2,"media" 20100123,1.2.2.4,"social" 20091114,2.4.6.1,"d.c." 20100101,9.7.6.5,"new year's" Map matching records to (YYYYMM, count=1) 200912, 1 200910, 1 200911, 1 200912, 1 200910, 1 ... 200912, 1 200912, 1 ... 200911, 1 200910, 1 ... 200912, 2 ... 200911, 1 Shuffle to collect all records w/ same key (month) Reduce results by adding count values for each key Jake Hofman (Columbia University) Counting at Scale February 6, 2013 16 / 27
MapReduce: paradigm Programmer specifies map and reduce functions Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: paradigm Map: tranforms input record to intermediate (key, value) pair Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: paradigm Shuffle: collects all intermediate records by key Record assigned to reducers by hash(key) % num reducers Reducers perform a merge sort to collect records with same key Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: paradigm Reduce: transforms all records for given key to final output Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: paradigm Distributed read, shuffle, and write are transparent to programmer Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
MapReduce: principles • Move code to data (local computation) • Allow programs to scale transparently w.r.t size of input • Abstract away fault tolerance, synchronization, etc. Jake Hofman (Columbia University) Counting at Scale February 6, 2013 18 / 27
MapReduce: strengths • Batch, offline jobs • Write-once, read-many across full data set • Usually, though not always, simple computations • I/O bound by disk/network bandwidth Jake Hofman (Columbia University) Counting at Scale February 6, 2013 19 / 27
!MapReduce What it’s not: • High-performance parallel computing, e.g. MPI • Low-latency random access relational database • Always the right solution Jake Hofman (Columbia University) Counting at Scale February 6, 2013 20 / 27
Word count dog 2 -- 1 the 3 brown 1 fox 2 jumped 1 lazy 2 jumps 1 over 2 quick 1 that 1 who 1 ? 1 the quick brown fox jumps over the lazy dog who jumped over that lazy dog -- the fox ? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
Word count Map: for each line, output each word and count (of 1) the quick brown fox -------------------------------- jumps over the lazy dog -------------------------------- who jumped over that -------------------------------- lazy dog -- the fox ? the 1 quick 1 brown 1 fox 1 --------- jumps 1 over 1 the 1 lazy 1 dog 1 --------- who 1 jumped 1 over 1 --------- that 1 lazy 1 dog 1 -- 1 the 1 fox 1 ? 1 Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
Word count Shuffle: collect all records for each word the quick brown fox -------------------------------- jumps over the lazy dog -------------------------------- who jumped over that -------------------------------- lazy dog -- the fox ? -- 1 --------- ? 1 --------- brown 1 --------- dog 1 dog 1 --------- fox 1 fox 1 --------- jumped 1 --------- jumps 1 --------- lazy 1 lazy 1 --------- over 1 over 1 --------- quick 1 --------- that 1 --------- the 1 the 1 the 1 --------- who 1 Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
Word count Reduce: add counts for each word -- 1 --------- ? 1 --------- brown 1 --------- dog 1 dog 1 --------- fox 1 fox 1 --------- jumped 1 --------- jumps 1 --------- lazy 1 lazy 1 --------- over 1 over 1 --------- quick 1 --------- that 1 --------- the 1 the 1 the 1 --------- who 1 -- 1 ? 1 brown 1 dog 2 fox 2 jumped 1 jumps 1 lazy 2 over 2 quick 1 that 1 the 3 who 1 Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
Word count dog 1 dog 1 --------- -- 1 --------- the 1 the 1 the 1 --------- brown 1 --------- fox 1 fox 1 --------- jumped 1 --------- lazy 1 lazy 1 --------- jumps 1 --------- over 1 over 1 --------- quick 1 --------- that 1 --------- ? 1 --------- who 1 dog 2 -- 1 the 3 brown 1 fox 2 jumped 1 lazy 2 jumps 1 over 2 quick 1 that 1 who 1 ? 1 the quick brown fox -------------------------------- jumps over the lazy dog -------------------------------- who jumped over that -------------------------------- lazy dog -- the fox ? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
WordCount.java Jake Hofman (Columbia University) Counting at Scale February 6, 2013 22 / 27
Hadoop streaming Jake Hofman (Columbia University) Counting at Scale February 6, 2013 23 / 27
Hadoop streaming MapReduce for *nix geeks1: # cat data | map | sort | reduce • Mapper reads input data from stdin • Mapper writes output to stdout • Reducer receives input, sorted by key, on stdin • Reducer writes output to stdout 1 http://bit.ly/michaelnoll Jake Hofman (Columbia University) Counting at Scale February 6, 2013 24 / 27
wordcount.sh Locally: # cat data | tr " " "n" | sort | uniq -c Jake Hofman (Columbia University) Counting at Scale February 6, 2013 25 / 27
wordcount.sh Locally: # cat data | tr " " "n" | sort | uniq -c ⇓ Distributed: Jake Hofman (Columbia University) Counting at Scale February 6, 2013 25 / 27
Transparent scaling Use the same code on MBs locally or TBs across thousands of machines. Jake Hofman (Columbia University) Counting at Scale February 6, 2013 26 / 27
wordcount.py Jake Hofman (Columbia University) Counting at Scale February 6, 2013 27 / 27

Modeling Social Data, Lecture 3: Counting at Scale

  • 1.
    Counting at Scale APAME4990 Modeling Social Data Jake Hofman Columbia University February 6, 2013 Jake Hofman (Columbia University) Counting at Scale February 6, 2013 1 / 27
  • 2.
    Last week Claim: Solving thecounting problem at scale enables you to investigate many interesting questions in the social sciences Jake Hofman (Columbia University) Counting at Scale February 6, 2013 2 / 27
  • 3.
    Learning to count Lastweek: Counting at small/medium scales on a single machine Jake Hofman (Columbia University) Counting at Scale February 6, 2013 3 / 27
  • 4.
    Learning to count Lastweek: Counting at small/medium scales on a single machine This week: Counting at large scales in parallel Jake Hofman (Columbia University) Counting at Scale February 6, 2013 3 / 27
  • 5.
    What? Jake Hofman (ColumbiaUniversity) Counting at Scale February 6, 2013 4 / 27
  • 6.
    What? “... to createbuilding blocks for programmers who just happen to have lots of data to store, lots of data to analyze, or lots of machines to coordinate, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.” -Tom White Hadoop: The Definitive Guide Jake Hofman (Columbia University) Counting at Scale February 6, 2013 5 / 27
  • 7.
    What? Hadoop contains manysubprojects: We’ll focus on distributed computation with MapReduce. Jake Hofman (Columbia University) Counting at Scale February 6, 2013 6 / 27
  • 8.
    Who/when? An overly briefhistory Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
  • 9.
    Who/when? pre-2004 Doug Cutting andMike Cafarella develop open source projects for web-scale indexing, crawling, and search Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
  • 10.
    Who/when? 2004 Dean and Ghemawatpublish MapReduce programming model, used internally at Google Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
  • 11.
    Who/when? 2006 Hadoop becomes officialApache project, Cutting joins Yahoo!, Yahoo adopts Hadoop Jake Hofman (Columbia University) Counting at Scale February 6, 2013 7 / 27
  • 12.
    Who/when? Jake Hofman (ColumbiaUniversity) Counting at Scale February 6, 2013 7 / 27
  • 13.
    Where? http://wiki.apache.org/hadoop/PoweredBy Jake Hofman (ColumbiaUniversity) Counting at Scale February 6, 2013 8 / 27
  • 14.
    Why? Why yet anothersolution? (I already use too many languages/environments) Jake Hofman (Columbia University) Counting at Scale February 6, 2013 9 / 27
  • 15.
    Why? Why a distributedsolution? (My desktop has TBs of storage and GBs of memory) Jake Hofman (Columbia University) Counting at Scale February 6, 2013 9 / 27
  • 16.
    Why? Roughly how longto read 1TB from a commodity hard disk? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 10 / 27
  • 17.
    Why? Roughly how longto read 1TB from a commodity hard disk? 1 2 Gb sec × 1 8 B b × 3600 sec hr ≈ 225 GB hr Jake Hofman (Columbia University) Counting at Scale February 6, 2013 10 / 27
  • 18.
    Why? Roughly how longto read 1TB from a commodity hard disk? ≈ 4hrs Jake Hofman (Columbia University) Counting at Scale February 6, 2013 10 / 27
  • 19.
    Why? http://bit.ly/petabytesort Jake Hofman (ColumbiaUniversity) Counting at Scale February 6, 2013 11 / 27
  • 20.
    Typical scenario Store, parse,and analyze high-volume server logs, e.g. how many search queries match “icwsm”? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 12 / 27
  • 21.
    MapReduce: 30k ft Breaklarge problem into smaller parts, solve in parallel, combine results Jake Hofman (Columbia University) Counting at Scale February 6, 2013 13 / 27
  • 22.
    Typical scenario “Embarassingly parallel” (ornearly so) node 1 local read filter node 2 local read filter node 3 local read filter node 4 local read filter }collect results Jake Hofman (Columbia University) Counting at Scale February 6, 2013 14 / 27
  • 23.
    Typical scenario++ How manysearch queries match “icwsm”, grouped by month? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 15 / 27
  • 24.
    MapReduce: example 20091201,4.2.2.1,"icwsm 2010" 20100523,2.4.1.2,"hadoop" 20100101,9.7.6.5,"tutorial" 20091125,2.4.6.1,"data" 20090708,4.2.2.1,"opensource" 20100124,1.2.2.4,"washington dc" 20100522,2.4.1.2,"conference" 20091008,4.2.2.1,"2009 icwsm" 20090807,4.2.2.1,"apache.org" 20100101,9.7.6.5,"mapreduce" 20100123,1.2.2.4,"washington dc" 20091121,2.4.6.1,"icwsm dates" 20090807,4.2.2.1,"distributed" 20091225,4.2.2.1,"icwsm" 20100522,2.4.1.2,"media" 20100123,1.2.2.4,"social" 20091114,2.4.6.1,"d.c." 20100101,9.7.6.5,"new year's" Map matching records to (YYYYMM, count=1) 200912, 1 200910, 1 200911, 1 200912, 1 200910, 1 ... 200912, 1 200912, 1 ... 200911, 1 200910, 1 ... 200912, 2 ... 200911, 1 Shuffle to collect all records w/ same key (month) Reduce results by adding count values for each key Jake Hofman (Columbia University) Counting at Scale February 6, 2013 16 / 27
  • 25.
    MapReduce: paradigm Programmer specifiesmap and reduce functions Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
  • 26.
    MapReduce: paradigm Map: tranformsinput record to intermediate (key, value) pair Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
  • 27.
    MapReduce: paradigm Shuffle: collectsall intermediate records by key Record assigned to reducers by hash(key) % num reducers Reducers perform a merge sort to collect records with same key Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
  • 28.
    MapReduce: paradigm Reduce: transformsall records for given key to final output Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
  • 29.
    MapReduce: paradigm Distributed read,shuffle, and write are transparent to programmer Jake Hofman (Columbia University) Counting at Scale February 6, 2013 17 / 27
  • 30.
    MapReduce: principles • Movecode to data (local computation) • Allow programs to scale transparently w.r.t size of input • Abstract away fault tolerance, synchronization, etc. Jake Hofman (Columbia University) Counting at Scale February 6, 2013 18 / 27
  • 31.
    MapReduce: strengths • Batch,offline jobs • Write-once, read-many across full data set • Usually, though not always, simple computations • I/O bound by disk/network bandwidth Jake Hofman (Columbia University) Counting at Scale February 6, 2013 19 / 27
  • 32.
    !MapReduce What it’s not: •High-performance parallel computing, e.g. MPI • Low-latency random access relational database • Always the right solution Jake Hofman (Columbia University) Counting at Scale February 6, 2013 20 / 27
  • 33.
    Word count dog 2 --1 the 3 brown 1 fox 2 jumped 1 lazy 2 jumps 1 over 2 quick 1 that 1 who 1 ? 1 the quick brown fox jumps over the lazy dog who jumped over that lazy dog -- the fox ? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
  • 34.
    Word count Map: foreach line, output each word and count (of 1) the quick brown fox -------------------------------- jumps over the lazy dog -------------------------------- who jumped over that -------------------------------- lazy dog -- the fox ? the 1 quick 1 brown 1 fox 1 --------- jumps 1 over 1 the 1 lazy 1 dog 1 --------- who 1 jumped 1 over 1 --------- that 1 lazy 1 dog 1 -- 1 the 1 fox 1 ? 1 Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
  • 35.
    Word count Shuffle: collectall records for each word the quick brown fox -------------------------------- jumps over the lazy dog -------------------------------- who jumped over that -------------------------------- lazy dog -- the fox ? -- 1 --------- ? 1 --------- brown 1 --------- dog 1 dog 1 --------- fox 1 fox 1 --------- jumped 1 --------- jumps 1 --------- lazy 1 lazy 1 --------- over 1 over 1 --------- quick 1 --------- that 1 --------- the 1 the 1 the 1 --------- who 1 Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
  • 36.
    Word count Reduce: addcounts for each word -- 1 --------- ? 1 --------- brown 1 --------- dog 1 dog 1 --------- fox 1 fox 1 --------- jumped 1 --------- jumps 1 --------- lazy 1 lazy 1 --------- over 1 over 1 --------- quick 1 --------- that 1 --------- the 1 the 1 the 1 --------- who 1 -- 1 ? 1 brown 1 dog 2 fox 2 jumped 1 jumps 1 lazy 2 over 2 quick 1 that 1 the 3 who 1 Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
  • 37.
    Word count dog 1 dog1 --------- -- 1 --------- the 1 the 1 the 1 --------- brown 1 --------- fox 1 fox 1 --------- jumped 1 --------- lazy 1 lazy 1 --------- jumps 1 --------- over 1 over 1 --------- quick 1 --------- that 1 --------- ? 1 --------- who 1 dog 2 -- 1 the 3 brown 1 fox 2 jumped 1 lazy 2 jumps 1 over 2 quick 1 that 1 who 1 ? 1 the quick brown fox -------------------------------- jumps over the lazy dog -------------------------------- who jumped over that -------------------------------- lazy dog -- the fox ? Jake Hofman (Columbia University) Counting at Scale February 6, 2013 21 / 27
  • 38.
    WordCount.java Jake Hofman (ColumbiaUniversity) Counting at Scale February 6, 2013 22 / 27
  • 39.
    Hadoop streaming Jake Hofman(Columbia University) Counting at Scale February 6, 2013 23 / 27
  • 40.
    Hadoop streaming MapReduce for*nix geeks1: # cat data | map | sort | reduce • Mapper reads input data from stdin • Mapper writes output to stdout • Reducer receives input, sorted by key, on stdin • Reducer writes output to stdout 1 http://bit.ly/michaelnoll Jake Hofman (Columbia University) Counting at Scale February 6, 2013 24 / 27
  • 41.
    wordcount.sh Locally: # cat data| tr " " "n" | sort | uniq -c Jake Hofman (Columbia University) Counting at Scale February 6, 2013 25 / 27
  • 42.
    wordcount.sh Locally: # cat data| tr " " "n" | sort | uniq -c ⇓ Distributed: Jake Hofman (Columbia University) Counting at Scale February 6, 2013 25 / 27
  • 43.
    Transparent scaling Use thesame code on MBs locally or TBs across thousands of machines. Jake Hofman (Columbia University) Counting at Scale February 6, 2013 26 / 27
  • 44.
    wordcount.py Jake Hofman (ColumbiaUniversity) Counting at Scale February 6, 2013 27 / 27