Hadoop and Spark for the SAS Developer

Hadoop and Spark for the SAS Developer Richard Williamson | @superhadooper 10 June 2015

3@SVDataScience © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.  My Background  Overview: SAS vs. Spark  Spark DataFrame vs. SAS Dataset  Spark SQL vs. SAS Proc SQL  Spark MLlib vs. SAS Stats  Spark Streaming  Questions? AGENDA

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience * http://en.wikipedia.org/wiki/SAS_%28software%29 ** http://techcrunch.com/2015/03/19/on-the-growth-of-apache-spark OVERVIEW: SAS vs. Spark SAS • SAS is the largest market-share holder in "advanced analytics" with 36.2% of the market as of 2012.* Spark • Launched in U.C. Berkeley’s AMPLab in 2009, Apache Spark has begun to catch on like wildfire during the last year and a half. Spark had more than 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among big data open source projects globally.**

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS • Basic Programming model consists of SAS Data Step and SAS Procedures • SAS Datasets move data between processing steps Spark • Native language is Scala—allows generic data types and flexible programming model (Java and Python also supported) • RDDs (and now DataFrames) are used to move distributed datasets between processing steps

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS Code Snippet http://support.sas.com/kb/24/595.html data old; input state $ accttot; datalines; ca 7000 ca 6500 ca 5800 nc 4800 nc 3640 sc 3520 va 4490 va 8700 va 2850 va 1111; Spark Code Snippet import sqlContext.implicits._ case class OLD(state: String, accttot: Int) val oldList = List( OLD("va",1111), OLD("ca",7000), OLD("ca",6500), OLD("ca",5800), OLD("nc",4800), OLD("nc",3640), OLD("sc",3520), OLD("va",4490), OLD("va",8700), OLD("va",2850) )

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS Code Snippet http://support.sas.com/kb/24/595.html proc sort data=old; by state; data new; set old (drop= accttot); by state; if first.state then count=0; count+1; if last.state then output; proc freq; tables state / out=new(drop=percent) Spark Code Snippet val oldRDD = sc.parallelize(oldList) var oldDataFrame = oldRDD.toDF() oldDataFrame = oldDataFrame.orderBy("state”) oldRDD.aggregateByKey(0)((buffer, value) => buffer + value, (b1,b2) => b1 + b2).foreach(println) val newDataFrame = oldDataFrame.groupBy("state").count()

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark DataFrame vs. SAS Dataset Spark DataFrame — a distributed collection of data organized into named columns. https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#rename-of-schemardd-to-dataframe SAS Dataset — a SAS file stored in a SAS library organized as a table of observations (rows) and variables (columns) that can be processed by SAS software. http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001005709.htm

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How does Spark DataFrame differ from SAS Dataset? • built from ground up to be distributed and can be processed in parallel by multiple machines, whereas SAS Dataset has non-distributed roots • logical entity that is not necessarily paired with a serialized on-disk version, whereas a SAS Dataset has an on-disk manifestation Spark DataFrame vs. SAS Dataset

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark SQL vs. SAS Proc SQL • My first reaction to Spark SQL was, “This looks like Proc SQL” • SAS Proc SQL Simple Example: libname Example 'c:SASPROJECTS'; proc sql; create table newtable as select a.*, b.unique_consumer_id from Example.transactions as a, Example.consumer as b where a.ref_id=b.ref_id; quit;

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark SQL vs. SAS Proc SQL • Spark SQL Simple Example: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val newtable = sqlContext.sql(“ select a.*, b.unique_consumer_id from transactions as a, consumer as b where a.ref_id=b.ref_id”)

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib vs. SAS Stats • Spark Mllib • https://spark.apache.org/docs/latest/mllib-guide.html • Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction • SAS Stats • Traditional Add-on package to SAS for Statistics

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib Example Data Prep case class Meetup(mdatehr: String, mdate: String, mhour: String) val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3)) meetup5.registerTempTable("meetup5") val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where mdatehr >= '2015-02- 15 02' group by mdatehr,mdate,mhour") meetup6.registerTempTable("meetup6") sqlContext.sql("cache table meetup6”) val trainingData = meetup7.map { row => val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib Example Regression Model val trainingData = meetup7.map { row => val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))} val model = new RidgeRegressionWithSGD().run(trainingData) val scores = meetup7.map { row => val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, … row(23).toString().toDouble)) (row(25),row(26),row(27), model.predict(features))} scores.foreach(println)

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark Streaming val ssc = new StreamingContext(sc, Seconds(10)) val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2) val sqlContext = new org.apache.spark.sql.SQLContext(sc) lines.foreachRDD(rdd => { val lines2 = sqlContext.jsonRDD(rdd) lines2.registerTempTable("lines2") val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url, event.time,guests, member.member_id, member.member_name, member.other_services.facebook.identifier as facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier, member.other_services.twitter.identifier as twitter_identifier,member.photo, mtime,response, rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2") //PERFORM LOGIC HERE LIKE STREAMING REGRESSION }) ssc.start()

© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Key Takeaways • If you work with large data or compute intensive advanced analytics and want a platform built from the ground up to run faster on distributed servers then - try out Spark • If you would like to have more control over your code than just an added macro language - try out Spark • If you want to better leverage data stored in Hadoop - try out Spark • If you prefer the open source licensing model over a subscriptions model - try out Spark

@SVDataScience Richard Williamson richard@svds.com @superhadooper Yes, we’re hiring! info@svds.com

Hadoop and Spark for the SAS Developer

More Related Content

Viewers also liked

Similar to Hadoop and Spark for the SAS Developer

More from DataWorks Summit

Recently uploaded

Hadoop and Spark for the SAS Developer

Editor's Notes