Scalable Data Science in Python and R on Apache Spark

Scalable Data Science in Python and R on Apache Spark Felix Cheung Principal Engineer & Apache Spark Committer

Disclaimer: Apache Spark community contributions

Agenda • Intro to Spark, PySpark, SparkR • ML with Spark • Data Science PySpark – ML Pipeline API – Integrating with packages (tensorframe, BigDL, …) • Data Science SparkR – ML API – Integrating with packages (svm, randomForest, tensorflow…)

What is Spark? • General-purpose cluster computing system

What are the ways to use Spark? • Ingestion – Batch – Streaming (low latency, static data) • ETL – SQL – Leveraging Catalyst Optimizer, Tungsten, CBO (cost-based optimizer in Spark 2.2/2.3) • Distributed ML

PySpark • Python API with Pandas-like DataFrame • Interface to Pandas, numpy • cloudpickle to serialize functions/closures

Architecture • Python classes • Py4J • Daemon process

SparkR • R language APIs for Spark • Exposes Spark functionality in a R dplyr-like APIs • Runs as its own REPL sparkR • or as a R package loaded in IDEs like RStudio library(SparkR) sparkR.session()

Architecture • Native R classes and methods • RBackend • Scala wrapper/helper (eg. ML Pipeline) www.slideshare.net/SparkSummit/07-venkataraman-sun

High Performance • JVM processing, full access to DAG capabilities and Catalyst optimizer, predicate pushdown, code generation, etc. databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html

End-to-End Data-Machine Learning Pipeline • Ingestion • Data preprocessing, cleaning, ETL • Machine Learning – models, predictions • Serving

Why in Spark? ETL Ingest Machine Learning Analytic

Why in Spark • Existing workloads (eg. ETL) • Single application – Single language • Sampling, aggregation • Scale

Spark in ML Architecture 1 ETL Ingest Machine Learning Serve

Spark in ML Architecture 2 ETL Ingest Model Packages Packages Packages Packages Machine Learning Evaluate

Spark in ML Architecture 3 ETL Ingest Model Machine Learning Serve

Spark ML Pipeline • Inspired by scikit-learn • Pre-processing, feature extraction, model fitting, validation stages • Transformer • Estimator • Evaluator • Cross-validation/hyperparameter tuning

DataFrame Spark ML Pipeline Transformer EstimatorTransformer Feature engineering Modeling

Classification: Logistic regression Binomial logistic regression Multinomial logistic regression Decision tree classifier Random forest classifier Gradient-boosted tree classifier Multilayer perceptron classifier One-vs-Rest classifier (a.k.a. One-vs-All) Naive Bayes Models Clustering: K-means Latent Dirichlet allocation (LDA) Bisecting k-means Gaussian Mixture Model (GMM) Collaborative Filtering: Alternating Least Squares (ALS) Regression: Linear regression Generalized linear regression Decision tree regression Random forest regression Gradient-boosted tree regression Survival regression Isotonic regression

PySpark ML Pipline Model tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10, regParam=0.001) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) # Fit the pipeline to training documents. model = pipeline.fit(training) prediction = model.transform(test)

Large Scale, Distributed • 2012 paper “Large Scale Distributed Deep Networks” Jeff Dean et al • Model parallelism • Data parallelism https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf

Spark as a Scheduler • Run external, often native packages • Data-parallel tasks • Distributed execution external data

UDF • Register function to use in SQL – spark.udf.register("stringLengthInt", lambda x: len(x), IntegerType()) spark.sql("SELECT stringLengthInt('test')") • Function on RDD – rdd.map(lambda l: str(l.e)[:7]) – rdd.mapPartitions(lambda x: [(1,), (2,), (3,)])

PySpark UDF Python Executor UDF Row Batch Row

Considerations • Data transfer overhead (serialization/deserialization, network) • Native language (boxing, interpreted) • Lambda – one row at a time

PySpark Vectorized UDF • Apache Arrow – in-memory columnar – Row batch – Zero copy – Future: IPC • SPARK-21190

spark-sklearn • Train and evaluate multiple scikit-learn models – Single node parallel -> distributed • Dataframes -> numpy ndarrays or sparse matrices • GridSearchCV, gapply

CaffeOnSpark • GPU • RDMA to sync DL models • MPI Allreduce http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop

CaffeOnSpark def get_predictions(sqlContext, images, model, imagenet, lstmnet, vocab): rdd = images.mapPartitions(lambda im: predict_caption(im, model, imagenet, lstmnet, vocab)) ... return sqlContext.createDataFrame(rdd, schema).select("result.id", "result.prediction") def predict_caption(list_of_images, model, imagenet, lstmnet, vocab): out_iterator = [] ce = CaptionExperiment(str(model),str(imagenet),str(lstmnet),str(vocab)) for image in list_of_images: out_iterator.append(ce.getCaption(image)) return iter(out_iterator) https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/python/examples/ImageCaption.py

BigDL https://www.slideshare.net/SparkSummit/bigdl-a-distributed-deep-learning-library-on-spark-spark-summit-east-talk-by-yiheng-wang

BigDL • Distribute model training – P2P All Reduce – block manager as parameter server • Integration with Intel's Math Kernel Library (MKL) • Model Snapshot • Load Caffe/Torch Model

BigDL - CNN MNIST def build_model(class_num): model = Sequential() # Create a LeNet model model.add(Reshape([1, 28, 28])) model.add(SpatialConvolution(1, 6, 5, 5).set_name('conv1')) model.add(Tanh()) model.add(SpatialMaxPooling(2, 2, 2, 2).set_name('pool1')) model.add(Tanh()) model.add(SpatialConvolution(6, 12, 5, 5).set_name('conv2')) model.add(SpatialMaxPooling(2, 2, 2, 2).set_name('pool2')) model.add(Reshape([12 * 4 * 4])) model.add(Linear(12 * 4 * 4, 100).set_name('fc1')) model.add(Tanh()) model.add(Linear(100, class_num).set_name('score')) model.add(LogSoftMax()) return model https://github.com/intel-analytics/BigDL-Tutorials/blob/master/notebooks/neural_networks/cnn.ipynb lenet_model = build_model(10) state = {"learningRate": 0.4, "learningRateDecay": 0.0002} optimizer = Optimizer( model=lenet_model, training_rdd=train_data, criterion=ClassNLLCriterion(), optim_method="SGD", state=state, end_trigger=MaxEpoch(20), batch_size=2048) trained_model = optimizer.optimize() predictions = trained_model.predict(test_data)

mmlspark • Spark ML Pipeline model + rich data types: images, text • Deep Learning with CNTK – Train on GPU edge node • Transfer Learning # Initialize CNTKModel and define input and output columns cntkModel = CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(model File) # Train on dataset with internal spark pipeline scoredImages = cntkModel.transform(imagesWithLabels)

tensorframes • DataFrame -> TensorFlow • Hyperparameter tuning • Apply trained model at scale • Row or block operations • JVM to C++ (bypassing Python)

spark-deep-learning • Spark ML Pipeline model + complex data: image – Transformer • DL featurization • Transfer Learning • Model as SQL UDF • Later : hyperparameter tuning

spark-deep-learning predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3") predictions_df = predictor.transform(df) featurizer = DeepImageFeaturizer(modelName="InceptionV3") p = Pipeline(stages=[featurizer, lr]) sparkdl.registerKerasUDF("img_classify", "/mymodels/dogmodel.h5") SELECT image, img_classify(image) label FROM images

Ecosystem • DeepDist • CaffeOnSpark, TensorFlowOnSpark • Elephas (Keras) • Apache SystemML • Apache Hivemall (incubating) • Apache MxNet (incubating)

Spark ML Pipeline • Pre-processing, feature extraction, model fitting, validation stages • Transformer • Estimator • Cross-validation/hyperparameter tuning

SparkR API for ML Pipeline spark.lda( data = text, k = 20, maxIter = 25, optimizer = "em") RegexTokenizer StopWordsRemover CountVectorizer R JVM LDA API builds ML Pipeline

Model Operations • summary - print a summary of the fitted model • predict - make predictions on new data • write.ml/read.ml - save/load fitted models (slight layout difference: pipeline model plus R metadata)

RFormula • Specify modeling in symbolic form y ~ f0 + f1 response y is modeled linearly by f0 and f1 • Support a subset of R formula operators ~ , . , : , + , - • Implemented as feature transformer in core Spark, available to Scala/Java, Python • String label column is indexed • String term columns are one-hot encoded

Generalized Linear Model # R-like glm(Sepal_Length ~ Sepal_Width + Species, gaussianDF, family = "gaussian") spark.glm(binomialDF, Species ~ Sepal_Length + Sepal_Width, family = "binomial") • “binomial” output string label, prediction

Multilayer Perceptron Model spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 5, 4, 3), solver = "l-bfgs", maxIter = 100, tol = 0.5, stepSize = 1)

Random Forest spark.randomForest(df, Employed ~ ., type = "regression", maxDepth = 5, maxBins = 16) spark.randomForest(df, Species ~ Petal_Length + Petal_Width, "classification", numTree = 30) • “classification” index label, predicted label to string

Native-R UDF • User-Defined Functions - custom transformation • Apply by Partition • Apply by Group UDFdata.frame data.frame

Parallel Processing By Partition R R R Partition Partition Partition UDF UDF UDF data.frame data.frame data.frame data.frame data.frame data.frame

UDF: Apply by Partition • Similar to R apply • Function to process each partition of a DataFrame • Mapping of Spark/R data types dapply(carsSubDF, function(x) { x <- cbind(x, x$mpg * 1.61) }, schema)

UDF: Apply by Partition + Collect • No schema out <- dapplyCollect( carsSubDF, function(x) { x <- cbind(x, "kmpg" = x$mpg*1.61) })

Example - UDF results <- dapplyCollect(train, function(x) { model <- randomForest::randomForest(as.factor(dep_delayed_1 5min) ~ Distance + night + early, data = x, importance = TRUE, ntree = 20) predictions <- predict(model, t) data.frame(UniqueCarrier = t$UniqueCarrier, delayed = predictions) }) closure capture - serialize & broadcast “t” access package “randomForest::” at each invocation

UDF: Apply by Group • By grouping columns gapply(carsDF, "cyl", function(key, x) { y <- data.frame(key, max(x$mpg)) }, schema)

UDF: Apply by Group + Collect • No Schema out <- gapplyCollect(carsDF, "cyl", function(key, x) { y <- data.frame(key, max(x$mpg)) names(y) <- c("cyl", "max_mpg") y })

UDF Considerations • No support for nested structures as columns • Scaling up / data skew • Data variety • Performance costs • Serialization/deserialization, data transfer • Package management

UDF: lapply • Like R lapply or doParallel • Good for “embarrassingly parallel” tasks • Such as hyperparameter tuning

UDF: lapply • Take a native R list, distribute it • Run the UDF in parallel UDFelement *anything* vector/ list list

UDF: parallel distributed processing • Output is a list - needs to fit in memory at the driver costs <- exp(seq(from = log(1), to = log(1000), length.out = 5)) train <- function(cost) { model <- e1071::svm(Species ~ ., iris, cost = cost) summary(model) } summaries <- spark.lapply(costs, train)

UDF: model training tensorflow train <- function(step) { library(tensorflow) sess <- tf$InteractiveSession() ... result <- sess$run(list(merged, train_step), feed_dict = feed_dict(TRUE)) summary <- result[[1]] train_writer$add_summary(summary, step) step } spark.lapply(1:20000, train)

SparkR as a Package (target:2.2) • Goal: simple one-line installation of SparkR from CRAN install.packages("SparkR") • Spark Jar downloaded from official release and cached automatically, or manually install.spark() since Spark 2 • R vignettes • Community can write packages that depends on SparkR package, eg. SparkRext • Advanced Spark JVM interop APIs sparkR.newJObject, sparkR.callJMethod sparkR.callJStatic

Ecosystem • RStudio sparklyr • RevoScaleR/RxSpark, R Server • H2O R • Apache SystemML (R-like API) • STC R4ML • Apache MxNet (incubating)

Recap • Let Spark take care of things or call into packages • Be aware of your partitions! • Many efforts to make that seamless and efficient

Thank You. https://github.com/felixcheung linkedin: http://linkd.in/1OeZDb7

Scalable Data Science in Python and R on Apache Spark

More Related Content

What's hot

Similar to Scalable Data Science in Python and R on Apache Spark

Recently uploaded

Scalable Data Science in Python and R on Apache Spark