Counting @ Scale Sharad Goel Columbia University Computational Social Science: Lecture 3 February 8, 2013
Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantiles summaries of conditional distributions
Long tail video consumption on YouTube Digital divide time spent across various online properties Viral diffusion propagation of tweets on Twitter
Counting @ scale conceptually easy computationally hard
I/O bound difficult to read terabytes of data Network bound hard to transfer terabytes of data Memory bound cannot randomly access data points CPU bound even simple manipulations add up
Rank videos by popularity local video store 1K movies, 100K viewings
Rank videos by popularity local video store 1K movies, 100K viewings Load dataset into memory
Rank videos by popularity Netflix 100K movies, 1B viewings
Rank videos by popularity Netflix 100K movies, 1B viewings store counter for each movie in memory and stream through the dataset
Rank videos by popularity YouTube 10B videos, 10T viewings
Rank videos by popularity YouTube 10B videos, 10T viewings Trouble, with a capital ‘T’
Parallel computation Distribute work across several machines
10 parallel workers 1T views per worker maybe 5B unique videos on each 100 parallel workers 100B views per worker maybe 1B videos on each
split  count  sort by video  merge  sort by popularity
Core problem the same movie appears on multiple machines Solution do not split viewing data at random ensure individual movies are never split apart
split  count  sort by video  merge  sort by popularity
Shuffle (1st attempt) create a new file for every movie append viewing data to the appropriate file
Shuffle (2nd attempt) First time you see a movie, append it randomly to one of 10K files Next time you see the movie, append it to same file
Shuffle (3rd attempt) Hash the movie ID to determine which file to append it to ( Hash function maps large input space to small output space approximately uniformly )
MapReduce: Simplifed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004
Map assign each input line to one or more groups Shuffle aggregate groups Reduce operate on grouped data
Map assign each input line to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]
The Insight of MapReduce One can efficiently group identical items Many tasks are computationally easier on grouped data
Word Count Input text corpus Output number of occurrences of each word
Word Count Map line  words Reduce word group  size of group
MapReduce the unreasonable effectiveness of aggregation

Computational Social Science, Lecture 03: Counting at Scale, Part I

  • 1.
    Counting @ Scale Sharad Goel Columbia University Computational Social Science: Lecture 3 February 8, 2013
  • 2.
    Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantiles summaries of conditional distributions
  • 3.
    Long tail video consumptionon YouTube Digital divide time spent across various online properties Viral diffusion propagation of tweets on Twitter
  • 4.
    Counting @ scale conceptually easy computationally hard
  • 5.
    I/O bound difficultto read terabytes of data Network bound hard to transfer terabytes of data Memory bound cannot randomly access data points CPU bound even simple manipulations add up
  • 6.
    Rank videos bypopularity local video store 1K movies, 100K viewings
  • 7.
    Rank videos bypopularity local video store 1K movies, 100K viewings Load dataset into memory
  • 8.
    Rank videos bypopularity Netflix 100K movies, 1B viewings
  • 9.
    Rank videos bypopularity Netflix 100K movies, 1B viewings store counter for each movie in memory and stream through the dataset
  • 10.
    Rank videos bypopularity YouTube 10B videos, 10T viewings
  • 11.
    Rank videos bypopularity YouTube 10B videos, 10T viewings Trouble, with a capital ‘T’
  • 12.
    Parallel computation Distribute workacross several machines
  • 13.
    10 parallel workers 1T views per worker maybe 5B unique videos on each 100 parallel workers 100B views per worker maybe 1B videos on each
  • 14.
    split  count  sort by video  merge  sort by popularity
  • 15.
    Core problem the samemovie appears on multiple machines Solution do not split viewing data at random ensure individual movies are never split apart
  • 16.
    split  count  sort by video  merge  sort by popularity
  • 17.
    Shuffle (1st attempt) create a new file for every movie append viewing data to the appropriate file
  • 18.
    Shuffle (2nd attempt) First time you see a movie, append it randomly to one of 10K files Next time you see the movie, append it to same file
  • 19.
    Shuffle (3rd attempt) Hash the movie ID to determine which file to append it to ( Hash function maps large input space to small output space approximately uniformly )
  • 20.
    MapReduce: Simplifed Data Processingon Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004
  • 21.
    Map assign each inputline to one or more groups Shuffle aggregate groups Reduce operate on grouped data
  • 22.
    Map assign each inputline to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]
  • 23.
    The Insight ofMapReduce One can efficiently group identical items Many tasks are computationally easier on grouped data
  • 24.
    Word Count Input text corpus Output number of occurrences of each word
  • 25.
    Word Count Map line  words Reduce word group  size of group
  • 26.