Guagua: An Iterative Computing Framework on Hadoop Zhang Pengshan(David), PayPal
AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
ALIPAY vs. PAYPAL Q: Where is risk control in PayPal? A: Risk control is everywhere in paypal.com.
FRAUD TYPES IN PAYPAL Fraud Types in PayPal Account Take Over Stolen Financials INR/SNAD Credit Cards INR: Item Not Received SNAD: Significantly Not as Described
RISK CONTROL IN PAYPAL Models Rules Agents
RISK MODELING IN PAYPAL MODELING CHALLENGES Thousands of Features Algorithms (LR, NN, DT) Big Training Data SLA (Online) Simulation
RISK MODELING IN PAYPAL MODELING CHALLENGES Thousands of Features Algorithms (LR, NN, DT) Big Training Data SLA (Online) Simulation Q: How to train models with TB data and thousands of features?
AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
DISTRIBUTED NEURAL NETWORK ALGORITHM* Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master ACCUMULATE GRADIENTS UPDATE WEIGHTS Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master … 1st iteration 2nd iteration … GRADIENTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] * Distributed batch gradient descent algorithm.
DISTRIBUTED NEURAL NETWORK ALGORITHM Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master ACCUMULATE GRADIENTS UPDATE WEIGHTS Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master … 1st iteration 2nd iteration … GRADIENTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] Q: How to implement it?
WHY NOT MAHOUT OR SPARK? Mahout • No distributed logistic regression & neural network. • Iterative through Hadoop jobs, bad performance. Spark • No independent Spark cluster. • Hadoop cluster is still 1.0 based, not YARN. Q: How to implement it in Hadoop?
POSSIBLE SOLUTIONS Hadoop YARN Hadoop MapReduce Pros Flexible framework for framework Works well on all Hadoop versions Self resource management Mature computing model Internal fault tolerance, splits, UI … Cons 2.0.3-Alpha Different computing model PayPal Clusters: Hadoop 0.20.2 How to do iterative coordination? Extra fault tolerance, splits, UI … Q: How to implement it in Hadoop MapReduce?
AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
ITERATIVE COMPUTING MODEL IN GUAGUA Worker Worker Worker WORKER RESULT WORKER RESULT Master Worker Worker Worker WORKER RESULT WORKER RESULT WORKER RESULT Master … 1st iteration 2nd iteration … WORKER RESULT MASTER RESULT MASTER RESULT MASTER RESULT MASTER RESULT MASTER RESULT MASTER RESULT Guagua is a framework over such iterative computing model, compared with Hadoop 1.0 over MapReduce.
GUAGUA API MasterComputable WorkerComputable
GUAGUA OVERVIEW Iterative Computing Framework CORE Master- Workers Core MapReduce Adapter (For Hadoop 1.0) Coordination YARN Adapter (For Hadoop 2.0) Fault Tolerance Consistent Client Distributed Neural Network Application
PLUGGABLE, SCALABLE INTERCEPTORS Master Fault Tolerance Interceptor ZooKeeper Coordinator Perf Interceptor Timer User Defined Interceptors Master Computation Worker Fault Tolerance Interceptor ZooKeeper Coordinator Perf Interceptor Timer User Defined Interceptors Worker Computation * These two graphs are aspects for each iteration.
GUAGUA RUNTIME Master: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container)
GUAGUA RUNTIME Master: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) ZooKeeper Cluster REGISTER REGISTER REGISTER REGISTER REGISTER REGISTER 1. Master is listening znodes of workers. 2. Workers are listening znode of master.
GUAGUA RUNTIME Master: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) ZooKeeper Cluster UPDATE ITER UPDATE ITER UPDATE IITER UPDATE ITER UPDATE ITER UPDATE ITER 1. Data is loaded in worker memory in the first iteration. 2. Whole process is done when reaches maximal iteration or halt condition is triggered.
AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
FAULT TOLERANCE Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker … … … … … … 1 2 3 4 … n
FAULT TOLERANCE Master Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker … … … … … 1 2 3 4 … n * The same on workers.
STRAGGLER MITIGATION Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker 1 2 3 Worker Worker Worker Worker
STRAGGLER MITIGATION Master Worker Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Master Worker Worker Worker Worker 1 2 3
STRAGGLER MITIGATION Master Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Master Worker Worker Worker Worker 1 2 3
PROGRESS AND STATE REPORT 0.86 = 432/501 (Current Iteration) / (Total Iteration)
GUAGUA UNIT
AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
WHAT IS SHIFU? Shifu* is an open-source, end-to-end machine learning and data mining framework built on top of Hadoop. NEW INIT STATS VARSELECT NORMALIZE POSTTRAIN TRAIN EVAL Built on Guagua *Want to try Shifu? Please visit http://shifu.ml.
SHIFU ON GUAGUA (TRAIN STEP) NNMaster MasterInterceptor NNWorker NNOutput MasterComputable WorkerComputable AbstractWorkerComputable BasicMasterInterceptor Gradients Weights GUAGUA API SHIFU CODE ENCOG CODE
SHIFU NN vs. SPARK LR Run Time Comparison Shifu-NN: 1102*20*1 Network, 319 Mappers * 1G Spark-LR: 1102 features, 120 executors * 3G 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time Iteration Shifu-NN Spark-LR
SHIFU NN BENCHMARK RESULTS All data are located in memory. At most we used 2400 mappers. 20 epochs are used. 1400 1200 1000 800 600 400 200 0 125G 375G 500G 625G 750G 875G 1000G Run Time (Seconds) Size of Input Time(Seconds)
AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
WHAT’S NEXT? • More open source docs • Support more (distributed) machine learning algorithms • Improve YARN (Beta) implementation • Support more input formats • Big model support • Deep learning support
Q&A
APPENDIX • Website • http://shifu.ml • http://shifu.ml/docs/guagua/ • Guagua issue website • https://github.com/shifuml/shifu/issues • https://github.com/shifuml/guagua/issues • Shifu & Guagua source code: • https://github.com/shifuml/shifu/ • https://github.com/shifuml/guagua/
@InfoQ infoqchina

Guagua an iterative computing framework on hadoop

  • 2.
    Guagua: An IterativeComputing Framework on Hadoop Zhang Pengshan(David), PayPal
  • 3.
    AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 4.
    AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 5.
    ALIPAY vs. PAYPAL Q: Where is risk control in PayPal? A: Risk control is everywhere in paypal.com.
  • 6.
    FRAUD TYPES INPAYPAL Fraud Types in PayPal Account Take Over Stolen Financials INR/SNAD Credit Cards INR: Item Not Received SNAD: Significantly Not as Described
  • 7.
    RISK CONTROL INPAYPAL Models Rules Agents
  • 8.
    RISK MODELING INPAYPAL MODELING CHALLENGES Thousands of Features Algorithms (LR, NN, DT) Big Training Data SLA (Online) Simulation
  • 9.
    RISK MODELING INPAYPAL MODELING CHALLENGES Thousands of Features Algorithms (LR, NN, DT) Big Training Data SLA (Online) Simulation Q: How to train models with TB data and thousands of features?
  • 10.
    AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 11.
    DISTRIBUTED NEURAL NETWORKALGORITHM* Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master ACCUMULATE GRADIENTS UPDATE WEIGHTS Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master … 1st iteration 2nd iteration … GRADIENTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] * Distributed batch gradient descent algorithm.
  • 12.
    DISTRIBUTED NEURAL NETWORKALGORITHM Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master ACCUMULATE GRADIENTS UPDATE WEIGHTS Worker Worker Worker GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] GRADIENTS: DOUBLE [] Master … 1st iteration 2nd iteration … GRADIENTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] ACCUMULATE GRADIENTS UPDATE WEIGHTS WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] WEIGHTS: DOUBLE [] Q: How to implement it?
  • 13.
    WHY NOT MAHOUTOR SPARK? Mahout • No distributed logistic regression & neural network. • Iterative through Hadoop jobs, bad performance. Spark • No independent Spark cluster. • Hadoop cluster is still 1.0 based, not YARN. Q: How to implement it in Hadoop?
  • 14.
    POSSIBLE SOLUTIONS HadoopYARN Hadoop MapReduce Pros Flexible framework for framework Works well on all Hadoop versions Self resource management Mature computing model Internal fault tolerance, splits, UI … Cons 2.0.3-Alpha Different computing model PayPal Clusters: Hadoop 0.20.2 How to do iterative coordination? Extra fault tolerance, splits, UI … Q: How to implement it in Hadoop MapReduce?
  • 15.
    AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 16.
    ITERATIVE COMPUTING MODELIN GUAGUA Worker Worker Worker WORKER RESULT WORKER RESULT Master Worker Worker Worker WORKER RESULT WORKER RESULT WORKER RESULT Master … 1st iteration 2nd iteration … WORKER RESULT MASTER RESULT MASTER RESULT MASTER RESULT MASTER RESULT MASTER RESULT MASTER RESULT Guagua is a framework over such iterative computing model, compared with Hadoop 1.0 over MapReduce.
  • 17.
    GUAGUA API MasterComputable WorkerComputable
  • 18.
    GUAGUA OVERVIEW Iterative Computing Framework CORE Master- Workers Core MapReduce Adapter (For Hadoop 1.0) Coordination YARN Adapter (For Hadoop 2.0) Fault Tolerance Consistent Client Distributed Neural Network Application
  • 19.
    PLUGGABLE, SCALABLE INTERCEPTORS Master Fault Tolerance Interceptor ZooKeeper Coordinator Perf Interceptor Timer User Defined Interceptors Master Computation Worker Fault Tolerance Interceptor ZooKeeper Coordinator Perf Interceptor Timer User Defined Interceptors Worker Computation * These two graphs are aspects for each iteration.
  • 20.
    GUAGUA RUNTIME Master:Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container)
  • 21.
    GUAGUA RUNTIME Master:Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) ZooKeeper Cluster REGISTER REGISTER REGISTER REGISTER REGISTER REGISTER 1. Master is listening znodes of workers. 2. Workers are listening znode of master.
  • 22.
    GUAGUA RUNTIME Master:Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) Worker: Mapper (Container) ZooKeeper Cluster UPDATE ITER UPDATE ITER UPDATE IITER UPDATE ITER UPDATE ITER UPDATE ITER 1. Data is loaded in worker memory in the first iteration. 2. Whole process is done when reaches maximal iteration or halt condition is triggered.
  • 23.
    AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 24.
    FAULT TOLERANCE Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker … … … … … … 1 2 3 4 … n
  • 25.
    FAULT TOLERANCE Master Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker … … … … … 1 2 3 4 … n * The same on workers.
  • 26.
    STRAGGLER MITIGATION Master Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Worker Master Worker 1 2 3 Worker Worker Worker Worker
  • 27.
    STRAGGLER MITIGATION Master Worker Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Master Worker Worker Worker Worker 1 2 3
  • 28.
    STRAGGLER MITIGATION Master Worker Worker Worker Worker Worker Worker Master Worker Worker Worker Worker Master Worker Worker Worker Worker 1 2 3
  • 29.
    PROGRESS AND STATEREPORT 0.86 = 432/501 (Current Iteration) / (Total Iteration)
  • 30.
  • 31.
    AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 32.
    WHAT IS SHIFU? Shifu* is an open-source, end-to-end machine learning and data mining framework built on top of Hadoop. NEW INIT STATS VARSELECT NORMALIZE POSTTRAIN TRAIN EVAL Built on Guagua *Want to try Shifu? Please visit http://shifu.ml.
  • 33.
    SHIFU ON GUAGUA(TRAIN STEP) NNMaster MasterInterceptor NNWorker NNOutput MasterComputable WorkerComputable AbstractWorkerComputable BasicMasterInterceptor Gradients Weights GUAGUA API SHIFU CODE ENCOG CODE
  • 34.
    SHIFU NN vs.SPARK LR Run Time Comparison Shifu-NN: 1102*20*1 Network, 319 Mappers * 1G Spark-LR: 1102 features, 120 executors * 3G 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Time Iteration Shifu-NN Spark-LR
  • 35.
    SHIFU NN BENCHMARKRESULTS All data are located in memory. At most we used 2400 mappers. 20 epochs are used. 1400 1200 1000 800 600 400 200 0 125G 375G 500G 625G 750G 875G 1000G Run Time (Seconds) Size of Input Time(Seconds)
  • 36.
    AGENDA • Introduction • Distributed Neural Network Algorithm • What is Guagua? • Guagua Advanced Features • Shifu on Guagua • Future Plans
  • 37.
    WHAT’S NEXT? •More open source docs • Support more (distributed) machine learning algorithms • Improve YARN (Beta) implementation • Support more input formats • Big model support • Deep learning support
  • 38.
  • 39.
    APPENDIX • Website • http://shifu.ml • http://shifu.ml/docs/guagua/ • Guagua issue website • https://github.com/shifuml/shifu/issues • https://github.com/shifuml/guagua/issues • Shifu & Guagua source code: • https://github.com/shifuml/shifu/ • https://github.com/shifuml/guagua/
  • 41.