Computational Social Science, Lecture 03: Counting at Scale, Part I

Counting @ Scale Sharad Goel Columbia University Computational Social Science: Lecture 3 February 8, 2013

Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantiles summaries of conditional distributions

Long tail video consumption on YouTube Digital divide time spent across various online properties Viral diffusion propagation of tweets on Twitter

Counting @ scale conceptually easy computationally hard

I/O bound difficult to read terabytes of data Network bound hard to transfer terabytes of data Memory bound cannot randomly access data points CPU bound even simple manipulations add up

Rank videos by popularity local video store 1K movies, 100K viewings

Rank videos by popularity local video store 1K movies, 100K viewings Load dataset into memory

Rank videos by popularity Netflix 100K movies, 1B viewings

Rank videos by popularity Netflix 100K movies, 1B viewings store counter for each movie in memory and stream through the dataset

Rank videos by popularity YouTube 10B videos, 10T viewings

Rank videos by popularity YouTube 10B videos, 10T viewings Trouble, with a capital ‘T’

Parallel computation Distribute work across several machines

10 parallel workers 1T views per worker maybe 5B unique videos on each 100 parallel workers 100B views per worker maybe 1B videos on each

split  count  sort by video  merge  sort by popularity

Core problem the same movie appears on multiple machines Solution do not split viewing data at random ensure individual movies are never split apart

Shuffle (1st attempt) create a new file for every movie append viewing data to the appropriate file

Shuffle (2nd attempt) First time you see a movie, append it randomly to one of 10K files Next time you see the movie, append it to same file

Shuffle (3rd attempt) Hash the movie ID to determine which file to append it to ( Hash function maps large input space to small output space approximately uniformly )

MapReduce: Simplifed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004

Map assign each input line to one or more groups Shuffle aggregate groups Reduce operate on grouped data

Map assign each input line to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]

The Insight of MapReduce One can efficiently group identical items Many tasks are computationally easier on grouped data

Word Count Input text corpus Output number of occurrences of each word

Word Count Map line  words Reduce word group  size of group

MapReduce the unreasonable effectiveness of aggregation

Computational Social Science, Lecture 03: Counting at Scale, Part I

More Related Content

Viewers also liked

Similar to Computational Social Science, Lecture 03: Counting at Scale, Part I

More from jakehofman

Recently uploaded

Computational Social Science, Lecture 03: Counting at Scale, Part I