Hadoop ecosystem framework Hadoop in live environment - Ashish Agrawal
Outline Introduction to HADOOP & Distributed FileSystems Architecture of Hadoop Ecosystem (Hbase/Pig) & setting up Hadoop Single/Multiple node cluster Introduction to MapReduce & running sample programs on Hadoop Hadoop ecosystem framework - Hadoop in live environment
Hadoop Ecosystem HDFS Map Reduce Hbase Pig Hive Mahout Zookeeper
HDFS Architecture
Map Reduce Flow By Ricky Ho
HBase Architecture
Job Scheduler CronJobs Chain Map Recude Azkaban By LinkedIn Oozie by Yahoo!
Overview of Oozie Manage data processing jobs Offers scalable data oriented service Manages dependencies between jobs Support job execution in topological order Provides time & event driven triggering mechanism
Overview of Oozie Supports map reduce, pig, filesystem, java applications, even map reduce streaming and pipes as action nodes Action nodes are connected through dependency edges Decision, fork and join nodes are used as flow control operations
Overview of Oozie Actions and decisions depends upon properties of job, hadoop counters or file/directory status A workflow application contains definition file for workflow, jar files, native and third party libraries, resource file and pig scripts
Oozie vs Azkaban Oozie can be restarted from point of failure but azkaban does not Oozie keeps flow in DB while azkaban keeps in memory Azkaban fixes execution path before starting job while Oozie allows decision nodes to decide Azkaban does not support event trigger Azkaban is used for simpler work flow
Chain MR Chains the multiple mapper classes in single map task which saves lots of I/O The output of immediate previous mapper is fed as input to current mapper The output of last mapper is written as task output Supports passing key/value pairs to next maps by reference to save [de]serialization time ChainReducer supports to chain multiple mapper classes after reducer within reducer task
Oozie Flow Start Map reduce Fork MR Streaming Pig Join Decision MR Pipes Java FileSystem End
Performance Tuning Parameters Network bandwidth – Gigabytes Nw Disk throughput – SCSI Drives Memory usage – ECC RAM CPU overhead for thread handling HDFS block size Max number of requests allowed in progress Per user file descriptors – needs to be set high Running the balancer
Performance Tuning Parameters Sufficient space for temp directory Compressed data storage Speculative data execution Use of combiner function – Associative & commulative Selection of Job scheduler : FIFO/Capacity/Fair Number of mappers : larger files are preferred Number of reducers : Slightly less than #nodes
Performance Tuning Parameters Compression of intermediate data from Mappers sort size (io.sort.mb) – larger if mapper has to write large data Sort factor (io.sort.factor) – set high for larger jobs (#input files can be merged at once) mapred.reduce.parallel.copies - higher for large jobs dfs.namenode.handler.count & dfs.datanode.handler.count – high for large cluster
Tips Use an appropriate MapReduce language Java : Speed, control and binary data. Working with existing libraries. Pipes : Working with existing C++ libraries Streaming : Writing MR in scripting languages Dumbo (Python), happy(Jython), Wukong (Ruby) Pig, Hive, Cascading : For nested data, joins etc Thumb Rule : P ure Java for large, recurring jobs, Hive for SQL style analysis and Pig/Streaming for ad-hoc analysis.
Tips Few Larger files are preferred over many smaller files Report Progress For CPU intensive job, increase the mapred.task.timeout (default 10 mins) Use Distributed cache To make data available to all mappers/reducers. For example keeping look up hash map Used to make auxiliary jars available among mappers/reducers
Tips Use SequenceFile and MapFile Splittable. Unlike other compressable format, they are map reduce job friendly and each map gets an independent split to work on Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still. Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format. A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key.
Mahout (Machine learning library) Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier
Different minds Different interpretation http://www.youtube.com/watch?v=9izUKE5bN0U
Hadoop in live environment Google Yahoo Amazon LinkedIn Facebook StumbleUpon Nokia Last.fm Clickable
@Google Google uses it for indexing the web computing PageRank processing geographic information in Google Maps clustering news articles, machine translation Google Trends etc
@Google An Example : 403,152 TB (terabytes) data 394 machines were allocated Completion time is 6 minutes and a half. Google indexing system uses 20TB data Bigtable (Hbase) is used for many Google products such as Orkut, Finance etc. Sawzall is used for massive log processing
@Yahoo! The Two Quadrillionth Bit of π is 0! One of the largest computations took 23 days of wall clock time and 503 years of CPU time on a 1000-node cluster Yahoo! Has 4000 nodes in hadoop cluster Following slides have been taken from opencirrus summit 2009
Hadoop is critical to Yahoo’s business When you visit yahoo, you are interacting with data processed with Hadoop! Ads Optimization Content Optimization Search Index Content Feed Processing Machine Learning (e.g. Spam filters)
Tremendous Impact on Productivity Makes Developers & Scientists more productive Key computations solved in days and not months Projects move from research to production in days Easy to learn, even our rocket scientists use it! The major factors You don’t need to find new hardware to experiment You can work with all your data! Production and research based on same framework No need for R&D to do IT (it just works)
Search & Advertising Sciences Hadoop Applications: Search Assist™ Database for Search Assist™ is built using Hadoop. 3 years of log-data 20-steps of map-reduce Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days
Largest Hadoop Clusters in the Universe 25,000+ nodes (~200,000 cores) Clusters of up to 4,000 nodes 4 Tiers of clusters Development, Testing and QA (~10%) Proof of Concepts and Ad-Hoc work (~10%) Runs the latest version of Hadoop – currently 0.20 Science and Research (~60%) Runs more stable versions Production (~20%) Currently Hadoop 0.18.3
Large Hadoop-Based Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output 1480 nodes ~73 hours runtime ~490 TB shuffling ~280 TB output 2500 nodes Sort benchmarks (Jim Gray contest) 1 Terabyte sorted 209 seconds 900 nodes 1 Terabyte sorted 62 seconds, 1500 nodes 1 Petabyte sorted 16.25 hours, 3700 nodes Largest cluster 2000 nodes 6PB raw disk 16TB of RAM 16K CPUs 4000 nodes 16PB raw disk 64TB of RAM 32K CPUs (40% faster CPUs too)
@Facebook Claims to have the largest single Hadoop cluster in the world Have multiple clusters at separate data centers Largest warehouse cluster currently spans 3000 of machines Scan around 2 petabytes per day 300 people throughout the company query this warehouse every month
@Facebook Facebook ”messages” uses the Hbase in prod Collects click logs in near real time from web servers and stream them directly into Hadoop clusters Medium-term archiving of MySQL databases Fast backup and recovery from data stored in Hadoop File System Reduces maintenance and deployment costs for archiving petabyte size datasets.
@Nokia Started using hadoop in August 2009 in search analytics team Started with 15 machines as part of cluster To analyse large scale search logs for various analytics purposes Search relevance calculation Duplicate places handling, data cleaning Fuzzy query parsing and tagging for spelling correction and lookahead suggestion model
@Clickable Using Hbase, HDFS, Map reduce for various purposes such as data storage, analytics, reportings and recommendations 7 machines cluster for production Used Hbase to address continous data updates from networks or any other user action at our end.
@Stumbleupon Log early, log often, log everything No piece of data is too small or too noisy to be used in future Uses for apache log file processing and session analysis, spam detection
@Stumbleupon Uses Scribe to collect data directly into HDFS where it is reviewed and processed by number of systems Uses MR to extract data from logs for click counts Uses for search index updates, thumbnail creation and recommendation systems
Questions?

Hadoop ecosystem framework n hadoop in live environment

  • 1.
    Hadoop ecosystem framework Hadoop in live environment - Ashish Agrawal
  • 2.
    Outline Introduction toHADOOP & Distributed FileSystems Architecture of Hadoop Ecosystem (Hbase/Pig) & setting up Hadoop Single/Multiple node cluster Introduction to MapReduce & running sample programs on Hadoop Hadoop ecosystem framework - Hadoop in live environment
  • 3.
    Hadoop Ecosystem HDFSMap Reduce Hbase Pig Hive Mahout Zookeeper
  • 4.
  • 5.
    Map Reduce FlowBy Ricky Ho
  • 6.
  • 7.
    Job Scheduler CronJobsChain Map Recude Azkaban By LinkedIn Oozie by Yahoo!
  • 8.
    Overview of Oozie Manage data processing jobs Offers scalable data oriented service Manages dependencies between jobs Support job execution in topological order Provides time & event driven triggering mechanism
  • 9.
    Overview of OozieSupports map reduce, pig, filesystem, java applications, even map reduce streaming and pipes as action nodes Action nodes are connected through dependency edges Decision, fork and join nodes are used as flow control operations
  • 10.
    Overview of OozieActions and decisions depends upon properties of job, hadoop counters or file/directory status A workflow application contains definition file for workflow, jar files, native and third party libraries, resource file and pig scripts
  • 11.
    Oozie vs AzkabanOozie can be restarted from point of failure but azkaban does not Oozie keeps flow in DB while azkaban keeps in memory Azkaban fixes execution path before starting job while Oozie allows decision nodes to decide Azkaban does not support event trigger Azkaban is used for simpler work flow
  • 12.
    Chain MR Chainsthe multiple mapper classes in single map task which saves lots of I/O The output of immediate previous mapper is fed as input to current mapper The output of last mapper is written as task output Supports passing key/value pairs to next maps by reference to save [de]serialization time ChainReducer supports to chain multiple mapper classes after reducer within reducer task
  • 13.
    Oozie Flow StartMap reduce Fork MR Streaming Pig Join Decision MR Pipes Java FileSystem End
  • 14.
    Performance Tuning ParametersNetwork bandwidth – Gigabytes Nw Disk throughput – SCSI Drives Memory usage – ECC RAM CPU overhead for thread handling HDFS block size Max number of requests allowed in progress Per user file descriptors – needs to be set high Running the balancer
  • 15.
    Performance Tuning ParametersSufficient space for temp directory Compressed data storage Speculative data execution Use of combiner function – Associative & commulative Selection of Job scheduler : FIFO/Capacity/Fair Number of mappers : larger files are preferred Number of reducers : Slightly less than #nodes
  • 16.
    Performance Tuning ParametersCompression of intermediate data from Mappers sort size (io.sort.mb) – larger if mapper has to write large data Sort factor (io.sort.factor) – set high for larger jobs (#input files can be merged at once) mapred.reduce.parallel.copies - higher for large jobs dfs.namenode.handler.count & dfs.datanode.handler.count – high for large cluster
  • 17.
    Tips Use anappropriate MapReduce language Java : Speed, control and binary data. Working with existing libraries. Pipes : Working with existing C++ libraries Streaming : Writing MR in scripting languages Dumbo (Python), happy(Jython), Wukong (Ruby) Pig, Hive, Cascading : For nested data, joins etc Thumb Rule : P ure Java for large, recurring jobs, Hive for SQL style analysis and Pig/Streaming for ad-hoc analysis.
  • 18.
    Tips Few Largerfiles are preferred over many smaller files Report Progress For CPU intensive job, increase the mapred.task.timeout (default 10 mins) Use Distributed cache To make data available to all mappers/reducers. For example keeping look up hash map Used to make auxiliary jars available among mappers/reducers
  • 19.
    Tips Use SequenceFileand MapFile Splittable. Unlike other compressable format, they are map reduce job friendly and each map gets an independent split to work on Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still. Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format. A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key.
  • 20.
    Mahout (Machine learninglibrary) Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier
  • 21.
    Different minds Differentinterpretation http://www.youtube.com/watch?v=9izUKE5bN0U
  • 22.
    Hadoop in liveenvironment Google Yahoo Amazon LinkedIn Facebook StumbleUpon Nokia Last.fm Clickable
  • 23.
    @Google Google usesit for indexing the web computing PageRank processing geographic information in Google Maps clustering news articles, machine translation Google Trends etc
  • 24.
    @Google An Example: 403,152 TB (terabytes) data 394 machines were allocated Completion time is 6 minutes and a half. Google indexing system uses 20TB data Bigtable (Hbase) is used for many Google products such as Orkut, Finance etc. Sawzall is used for massive log processing
  • 25.
    @Yahoo! The TwoQuadrillionth Bit of π is 0! One of the largest computations took 23 days of wall clock time and 503 years of CPU time on a 1000-node cluster Yahoo! Has 4000 nodes in hadoop cluster Following slides have been taken from opencirrus summit 2009
  • 26.
    Hadoop is criticalto Yahoo’s business When you visit yahoo, you are interacting with data processed with Hadoop! Ads Optimization Content Optimization Search Index Content Feed Processing Machine Learning (e.g. Spam filters)
  • 27.
    Tremendous Impact onProductivity Makes Developers & Scientists more productive Key computations solved in days and not months Projects move from research to production in days Easy to learn, even our rocket scientists use it! The major factors You don’t need to find new hardware to experiment You can work with all your data! Production and research based on same framework No need for R&D to do IT (it just works)
  • 28.
    Search & AdvertisingSciences Hadoop Applications: Search Assist™ Database for Search Assist™ is built using Hadoop. 3 years of log-data 20-steps of map-reduce Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days
  • 29.
    Largest Hadoop Clustersin the Universe 25,000+ nodes (~200,000 cores) Clusters of up to 4,000 nodes 4 Tiers of clusters Development, Testing and QA (~10%) Proof of Concepts and Ad-Hoc work (~10%) Runs the latest version of Hadoop – currently 0.20 Science and Research (~60%) Runs more stable versions Production (~20%) Currently Hadoop 0.18.3
  • 30.
    Large Hadoop-Based Applications2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output 1480 nodes ~73 hours runtime ~490 TB shuffling ~280 TB output 2500 nodes Sort benchmarks (Jim Gray contest) 1 Terabyte sorted 209 seconds 900 nodes 1 Terabyte sorted 62 seconds, 1500 nodes 1 Petabyte sorted 16.25 hours, 3700 nodes Largest cluster 2000 nodes 6PB raw disk 16TB of RAM 16K CPUs 4000 nodes 16PB raw disk 64TB of RAM 32K CPUs (40% faster CPUs too)
  • 31.
    @Facebook Claims tohave the largest single Hadoop cluster in the world Have multiple clusters at separate data centers Largest warehouse cluster currently spans 3000 of machines Scan around 2 petabytes per day 300 people throughout the company query this warehouse every month
  • 32.
    @Facebook Facebook ”messages”uses the Hbase in prod Collects click logs in near real time from web servers and stream them directly into Hadoop clusters Medium-term archiving of MySQL databases Fast backup and recovery from data stored in Hadoop File System Reduces maintenance and deployment costs for archiving petabyte size datasets.
  • 33.
    @Nokia Started usinghadoop in August 2009 in search analytics team Started with 15 machines as part of cluster To analyse large scale search logs for various analytics purposes Search relevance calculation Duplicate places handling, data cleaning Fuzzy query parsing and tagging for spelling correction and lookahead suggestion model
  • 34.
    @Clickable Using Hbase,HDFS, Map reduce for various purposes such as data storage, analytics, reportings and recommendations 7 machines cluster for production Used Hbase to address continous data updates from networks or any other user action at our end.
  • 35.
    @Stumbleupon Log early,log often, log everything No piece of data is too small or too noisy to be used in future Uses for apache log file processing and session analysis, spam detection
  • 36.
    @Stumbleupon Uses Scribeto collect data directly into HDFS where it is reviewed and processed by number of systems Uses MR to extract data from logs for click counts Uses for search index updates, thumbnail creation and recommendation systems
  • 37.