Hadoop with Python

HADOOP WITH PYTHON Donald Miner @donaldpminer DC Python Meetup 3/10/15

Agenda • Introduction to Hadoop • MapReduce with mrjob • Pig with Python UDFs • snakebite for HDFS • HBase and python clients • Spark and PySpark

Hadoop Distributed File System (HDFS) • Stores files in folders (that’s it) • Nobody cares what’s in your files • Chunks large files into blocks (~64MB-2GB) • 3 replicas of each block (better safe than sorry) • Blocks are scattered all over the place FILE BLOCKS

MapReduce • Analyzes raw data in HDFS where the data is • Jobs are split into Mappers and Reducers Reducers (you code this, too) Automatically Groups by the mapper’s output key Aggregate, count, statistics Outputs to HDFS Mappers (you code this) Loads data from HDFS Filter, transform, parse Outputs (key, value) pairs

Hadoop Ecosystem • Higher-level languages like Pig and Hive • HDFS Data systems like HBase and Accumulo • Alternative execution engines like Storm and Spark • Close friends like ZooKeeper, Flume, Avro, Kafka

Cool Thing #1: Linear Scalability • HDFS and MapReduce scale linearly • If you have twice as many computers, jobs run twice as fast • If you have twice as much data, jobs run twice as slow • If you have twice as many computers, you can store twice as much data DATA LOCALITY!!

Cool Thing #2: Schema on Read LOAD DATA FIRST, ASK QUESTIONS LATER Data is parsed/interpreted as it is loaded out of HDFS What implications does this have? BEFORE: ETL, schema design upfront, tossing out original data, comprehensive data study Keep original data around! Have multiple views of the same data! Work with unstructured data sooner! Store first, figure out what to do with it later! WITH HADOOP:

Cool Thing #3: Transparent Parallelism Network programming? Inter-process communication? Threading? Distributed stuff? With MapReduce, I DON’T CARE Your solution … I just have to be sure my solution fits into this tiny box Fault tolerance? Code deployment? RPC? Message passing? Locking? MapReduce Framework Data storage? Scalability? Data center fires?

Cool Thing #4: Unstructured Data • Unstructured data: media, text, forms, log data lumped structured data • Query languages like SQL and Pig assume some sort of “structure” • MapReduce is just Java: You can do anything Java can do in a Mapper or Reducer

Why Python? • Python vs. Java • Compiled vs. scripts • Python libraries we all love • Integration with other things

Why Not? • Python vs. Java • Almost nothing is native • Performance • Being out of date • Being “weird” • Smaller community, almost no official support

mrjob • Write MapReduce jobs in Python! • Open sourced and maintained by Yelp • Wraps “Hadoop Streaming” in cpython Python 2.5+ • Well documented • Can run locally, in Amazon EMR, or Hadoop

Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run()

Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() The quick brown fox jumps over the lazy dog the, 1 quick, 1 brown, 1 fox, 1 jumps, 1 over, 1 the, 1 lazy, 1 dog, 1

Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() I like this Hadoop thing i, 1 like, 1 this, 1 hadoop, 1 thing, 1

Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() dog, [1, 1, 1, 1, 1, 1] dog, 6

Canonical Word Count from mrjob.job import MRJob import re WORD_RE = re.compile(r"[w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() cat, [1, 1, 1, 1, 1, 1, 1, 1] cat, 8

Other options http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/ Hadoop Streaming – More manual but faster Hadoopy, Dumbo, haven’t seen commits in years, mrjob in the past 12 hours Pydoop is main competitor (not in this list)

Pydoop • Write MapReduce jobs in Python! • Uses Hadoop C++ Pipes, which should be faster than wrapping streaming • Actively being worked on • I’m not sure which is better

Pydoop Word Count with open('stop.txt') as f: STOP_WORDS = set(l.strip() for l in f if not l.isspace()) def mapper(_, v, writer): for word in v.split(): if word in STOP_WORDS: writer.count("STOP_WORDS", 1) else: writer.emit(word, 1) def reducer(word, icounts, writer): writer.emit(word, sum(map(int, icounts))) $ pydoop script wc.py hdfs_input hdfs_output --upload- file-to-cache stop.txt

Pig • Pig is a higher-level platform and language for analyzing data that happens to run MapReduce underneath a = LOAD ’inputdata.txt’; b = FOREACH a GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word; c = GROUP b BY word; d = FOREACH c GENERATE group, COUNT(c); STORE d INTO ‘wc';

Pig UDFs Users can write user-defined functions to extend the functionality of Pig Can use jython (faster) or cpython (access to more libs) b = FOREACH a GENERATE revster(phonenum); ... m = GROUP j BY username; n = FOREACH m GENERATE group, sortedconcat(j.tags); @outputSchema(“tags:chararray") def sortedconcat(bag): out = set() for tag in bag: out.add(tag) return ‘-’.join(sorted(out)) @outputSchema(“rev:chararray") def revstr(instr): return instr[::-1]

• A pure Python client • Handles most NameNode ops (moving/renaming files, deleting files) • Handles most DataNode reading ops (reading files, getmerge) • Doesn’t handle writing to DataNodes yet • Two ways to use: library and command line interface

- Library from snakebite.client import Client client = Client(”1.2.3.4", 54310, use_trash=False) for x in client.ls(['/data']): print x print ‘’.join(client.cat(‘/data/ref/refdata*.csv’)) Useful for doing HDFS file manipulation in data flows or job setups Can be used to read reference data from MapReduce jobs

- CLI $ snakebite get /path/in/hdfs/mydata.txt /local/path/data.txt $ snakebite rm /path/in/hdfs/mydata.txt $ for fp in `snakebite ls /data/new/`; do snakebite mv “/data/new/$fp” “/data/in/`date ‘+%Y/%m/%d/’$fp done The “hadoop” CLI client is written in Java and spins up a new JVM every time (1-3 sec) Snakebite doesn’t have that problem, making it good for lots of programmatic interactions with HDFS.

From the website: Apache HBase is the Hadoop database, a distributed, scalable, big data store. When Would I Use Apache HBase? Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non- relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Python clients Starbase or Happybase Uses the HBase Thrift gateway interface (slow) Last commit 6 months ago Appears to be fully featured Not really there yet and have failed to gain community momentum. Java is still king.

From the website: Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming. In general, Spark is faster than MapReduce and easier to write than MapReduce

PySpark • Spark’s native language is Scala, but it also supports Java and Python • Python API is always a tad behind Scala • Programming in Spark (and PySpark) is in the form of chaining transformations and actions on RDDs • RDDs are “Resilient Distributed Datasets” • RDDs are kept in memory for the most part

PySpark Word Count Example import sys from operator import add from pyspark import SparkContext if __name__ == "__main__": if len(sys.argv) != 2: print >> sys.stderr, "Usage: wordcount <file>" exit(-1) sc = SparkContext(appName="PythonWordCount") lines = sc.textFile(sys.argv[1], 1) counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) output = counts.collect() for (word, count) in output: print "%s: %i" % (word, count) sc.stop()

Hadoop with Python

In this document