Hadoop and Spark for the SAS Developer Richard Williamson | @superhadooper 10 June 2015
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience
3@SVDataScience © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.  My Background  Overview: SAS vs. Spark  Spark DataFrame vs. SAS Dataset  Spark SQL vs. SAS Proc SQL  Spark MLlib vs. SAS Stats  Spark Streaming  Questions? AGENDA
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience * http://en.wikipedia.org/wiki/SAS_%28software%29 ** http://techcrunch.com/2015/03/19/on-the-growth-of-apache-spark OVERVIEW: SAS vs. Spark SAS • SAS is the largest market-share holder in "advanced analytics" with 36.2% of the market as of 2012.* Spark • Launched in U.C. Berkeley’s AMPLab in 2009, Apache Spark has begun to catch on like wildfire during the last year and a half. Spark had more than 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among big data open source projects globally.**
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS • Basic Programming model consists of SAS Data Step and SAS Procedures • SAS Datasets move data between processing steps Spark • Native language is Scala—allows generic data types and flexible programming model (Java and Python also supported) • RDDs (and now DataFrames) are used to move distributed datasets between processing steps
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS Code Snippet http://support.sas.com/kb/24/595.html data old; input state $ accttot; datalines; ca 7000 ca 6500 ca 5800 nc 4800 nc 3640 sc 3520 va 4490 va 8700 va 2850 va 1111; Spark Code Snippet import sqlContext.implicits._ case class OLD(state: String, accttot: Int) val oldList = List( OLD("va",1111), OLD("ca",7000), OLD("ca",6500), OLD("ca",5800), OLD("nc",4800), OLD("nc",3640), OLD("sc",3520), OLD("va",4490), OLD("va",8700), OLD("va",2850) )
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS Code Snippet http://support.sas.com/kb/24/595.html proc sort data=old; by state; data new; set old (drop= accttot); by state; if first.state then count=0; count+1; if last.state then output; proc freq; tables state / out=new(drop=percent) Spark Code Snippet val oldRDD = sc.parallelize(oldList) var oldDataFrame = oldRDD.toDF() oldDataFrame = oldDataFrame.orderBy("state”) oldRDD.aggregateByKey(0)((buffer, value) => buffer + value, (b1,b2) => b1 + b2).foreach(println) val newDataFrame = oldDataFrame.groupBy("state").count()
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark DataFrame vs. SAS Dataset Spark DataFrame — a distributed collection of data organized into named columns. https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#rename-of-schemardd-to-dataframe SAS Dataset — a SAS file stored in a SAS library organized as a table of observations (rows) and variables (columns) that can be processed by SAS software. http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001005709.htm
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How does Spark DataFrame differ from SAS Dataset? • built from ground up to be distributed and can be processed in parallel by multiple machines, whereas SAS Dataset has non-distributed roots • logical entity that is not necessarily paired with a serialized on-disk version, whereas a SAS Dataset has an on-disk manifestation Spark DataFrame vs. SAS Dataset
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark SQL vs. SAS Proc SQL • My first reaction to Spark SQL was, “This looks like Proc SQL” • SAS Proc SQL Simple Example: libname Example 'c:SASPROJECTS'; proc sql; create table newtable as select a.*, b.unique_consumer_id from Example.transactions as a, Example.consumer as b where a.ref_id=b.ref_id; quit;
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark SQL vs. SAS Proc SQL • Spark SQL Simple Example: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val newtable = sqlContext.sql(“ select a.*, b.unique_consumer_id from transactions as a, consumer as b where a.ref_id=b.ref_id”)
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib vs. SAS Stats • Spark Mllib • https://spark.apache.org/docs/latest/mllib-guide.html • Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction • SAS Stats • Traditional Add-on package to SAS for Statistics
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib Example Data Prep case class Meetup(mdatehr: String, mdate: String, mhour: String) val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3)) meetup5.registerTempTable("meetup5") val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where mdatehr >= '2015-02- 15 02' group by mdatehr,mdate,mhour") meetup6.registerTempTable("meetup6") sqlContext.sql("cache table meetup6”) val trainingData = meetup7.map { row => val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib Example Regression Model val trainingData = meetup7.map { row => val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))} val model = new RidgeRegressionWithSGD().run(trainingData) val scores = meetup7.map { row => val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, … row(23).toString().toDouble)) (row(25),row(26),row(27), model.predict(features))} scores.foreach(println)
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark Streaming val ssc = new StreamingContext(sc, Seconds(10)) val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2) val sqlContext = new org.apache.spark.sql.SQLContext(sc) lines.foreachRDD(rdd => { val lines2 = sqlContext.jsonRDD(rdd) lines2.registerTempTable("lines2") val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url, event.time,guests, member.member_id, member.member_name, member.other_services.facebook.identifier as facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier, member.other_services.twitter.identifier as twitter_identifier,member.photo, mtime,response, rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2") //PERFORM LOGIC HERE LIKE STREAMING REGRESSION }) ssc.start()
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Key Takeaways • If you work with large data or compute intensive advanced analytics and want a platform built from the ground up to run faster on distributed servers then - try out Spark • If you would like to have more control over your code than just an added macro language - try out Spark • If you want to better leverage data stored in Hadoop - try out Spark • If you prefer the open source licensing model over a subscriptions model - try out Spark
@SVDataScience Richard Williamson richard@svds.com @superhadooper Yes, we’re hiring! info@svds.com

Hadoop and Spark for the SAS Developer

  • 1.
    Hadoop and Sparkfor the SAS Developer Richard Williamson | @superhadooper 10 June 2015
  • 2.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience
  • 3.
    3@SVDataScience © 2015SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.  My Background  Overview: SAS vs. Spark  Spark DataFrame vs. SAS Dataset  Spark SQL vs. SAS Proc SQL  Spark MLlib vs. SAS Stats  Spark Streaming  Questions? AGENDA
  • 4.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience * http://en.wikipedia.org/wiki/SAS_%28software%29 ** http://techcrunch.com/2015/03/19/on-the-growth-of-apache-spark OVERVIEW: SAS vs. Spark SAS • SAS is the largest market-share holder in "advanced analytics" with 36.2% of the market as of 2012.* Spark • Launched in U.C. Berkeley’s AMPLab in 2009, Apache Spark has begun to catch on like wildfire during the last year and a half. Spark had more than 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among big data open source projects globally.**
  • 5.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS • Basic Programming model consists of SAS Data Step and SAS Procedures • SAS Datasets move data between processing steps Spark • Native language is Scala—allows generic data types and flexible programming model (Java and Python also supported) • RDDs (and now DataFrames) are used to move distributed datasets between processing steps
  • 6.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS Code Snippet http://support.sas.com/kb/24/595.html data old; input state $ accttot; datalines; ca 7000 ca 6500 ca 5800 nc 4800 nc 3640 sc 3520 va 4490 va 8700 va 2850 va 1111; Spark Code Snippet import sqlContext.implicits._ case class OLD(state: String, accttot: Int) val oldList = List( OLD("va",1111), OLD("ca",7000), OLD("ca",6500), OLD("ca",5800), OLD("nc",4800), OLD("nc",3640), OLD("sc",3520), OLD("va",4490), OLD("va",8700), OLD("va",2850) )
  • 7.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience OVERVIEW: SAS vs. Spark SAS Code Snippet http://support.sas.com/kb/24/595.html proc sort data=old; by state; data new; set old (drop= accttot); by state; if first.state then count=0; count+1; if last.state then output; proc freq; tables state / out=new(drop=percent) Spark Code Snippet val oldRDD = sc.parallelize(oldList) var oldDataFrame = oldRDD.toDF() oldDataFrame = oldDataFrame.orderBy("state”) oldRDD.aggregateByKey(0)((buffer, value) => buffer + value, (b1,b2) => b1 + b2).foreach(println) val newDataFrame = oldDataFrame.groupBy("state").count()
  • 8.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark DataFrame vs. SAS Dataset Spark DataFrame — a distributed collection of data organized into named columns. https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#rename-of-schemardd-to-dataframe SAS Dataset — a SAS file stored in a SAS library organized as a table of observations (rows) and variables (columns) that can be processed by SAS software. http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001005709.htm
  • 9.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How does Spark DataFrame differ from SAS Dataset? • built from ground up to be distributed and can be processed in parallel by multiple machines, whereas SAS Dataset has non-distributed roots • logical entity that is not necessarily paired with a serialized on-disk version, whereas a SAS Dataset has an on-disk manifestation Spark DataFrame vs. SAS Dataset
  • 10.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark SQL vs. SAS Proc SQL • My first reaction to Spark SQL was, “This looks like Proc SQL” • SAS Proc SQL Simple Example: libname Example 'c:SASPROJECTS'; proc sql; create table newtable as select a.*, b.unique_consumer_id from Example.transactions as a, Example.consumer as b where a.ref_id=b.ref_id; quit;
  • 11.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark SQL vs. SAS Proc SQL • Spark SQL Simple Example: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val newtable = sqlContext.sql(“ select a.*, b.unique_consumer_id from transactions as a, consumer as b where a.ref_id=b.ref_id”)
  • 12.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib vs. SAS Stats • Spark Mllib • https://spark.apache.org/docs/latest/mllib-guide.html • Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction • SAS Stats • Traditional Add-on package to SAS for Statistics
  • 13.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib Example Data Prep case class Meetup(mdatehr: String, mdate: String, mhour: String) val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3)) meetup5.registerTempTable("meetup5") val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where mdatehr >= '2015-02- 15 02' group by mdatehr,mdate,mhour") meetup6.registerTempTable("meetup6") sqlContext.sql("cache table meetup6”) val trainingData = meetup7.map { row => val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
  • 14.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark MLlib Example Regression Model val trainingData = meetup7.map { row => val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))} val model = new RidgeRegressionWithSGD().run(trainingData) val scores = meetup7.map { row => val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, … row(23).toString().toDouble)) (row(25),row(26),row(27), model.predict(features))} scores.foreach(println)
  • 15.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Spark Streaming val ssc = new StreamingContext(sc, Seconds(10)) val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2) val sqlContext = new org.apache.spark.sql.SQLContext(sc) lines.foreachRDD(rdd => { val lines2 = sqlContext.jsonRDD(rdd) lines2.registerTempTable("lines2") val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url, event.time,guests, member.member_id, member.member_name, member.other_services.facebook.identifier as facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier, member.other_services.twitter.identifier as twitter_identifier,member.photo, mtime,response, rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2") //PERFORM LOGIC HERE LIKE STREAMING REGRESSION }) ssc.start()
  • 16.
    © 2015 SILICONVALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Key Takeaways • If you work with large data or compute intensive advanced analytics and want a platform built from the ground up to run faster on distributed servers then - try out Spark • If you would like to have more control over your code than just an added macro language - try out Spark • If you want to better leverage data stored in Hadoop - try out Spark • If you prefer the open source licensing model over a subscriptions model - try out Spark
  • 17.

Editor's Notes

  • #5 Retailer Inventory Mgmt
  • #6 SHOW of HANDS for SAS vs Spark development SparkR
  • #9 A Spark Dataframe is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. A SAS Dataset also contains descriptor information such as the data types and lengths of the variables, as well as which engine was used to create the data.
  • #12 Mention addition of Windowing functions in 1.4 and possibly pivot/transpose