Functional Programming and Big Data Dan Marshall Big Data Applications Engineer June 11, 2015 © 2015 Wells Fargo Bank, N.A. All rights reserved. For public use.
11 Agenda  Functional vs Other Models  Functional Thinking and Big Data  Case Study
22 Functional Programming  It’s new and shiny! It’s the latest thing!  Not so much… – Mathematical foundations introduced with lambda calculus in the 1930s – Lisp language created at MIT in 1958 – Google based much of original Map/Reduce research on well-established Functional Programming concepts  “Pure” Functional Languages – Common Lisp, Haskell, F#, Erlang – Can be mathematically proven to be correct (as opposed to “Design Patterns”)  Hybrids – Scala, Clojure  Non-functional (mostly – with increasing FP capabilities) – C, C++, Java, Python This presentation focuses on real life application – not academic theory. “Think functionally” is most important to start.
33 Functional Programming  First-class functions – Functions are “first-class citizens” and can be used anywhere in the code – Function can be an argument to another function – or returned as a result from a function  Pure functions – No side effects – Facilitates parallel and/or concurrent operations and optimizations  Recursion – Iteration done through recursion (as opposed to looping)  Immutability – Avoid changing state and avoid mutable data  Declarative (vs Imperative) – What to do, not how to do it http://en.wikipedia.org/wiki/Functional_programming
44 Simple Scala Example
55 Simple Scala Example List of Hadoop Summit attendees Define method Function literal Pass function Pass alternate function
66 Python Example
77 Pure Functions Y = f(x) 1. The function always evaluates the same result value given the same argument value(s). The function result value cannot depend on any hidden information… 2. Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects… http://en.wikipedia.org/wiki/Pure_function  Parallelization – work in parallel, gather results  Referentially Transparent – results can be cached and/or reused safely  Lazy evaluation – Because of guarantee of no side effect, may be able to skip function
88 Recursion • Important concept in computer science – many nuances – big topic • For this discussion, think recursion vs iteration • Recursion in FP allows for: • “Iteration” without mutating • “Tail call” optimization to avoid blowing out the stack • Simpler code Bad: for (int x = 0 ; x < len(myList); x++) { myList2(x) = doSomething with myList… } Good: for (var x in myList): produce myList2 Better: myList2 = map(doSomething, myList)
99 Immutability • Once an object/structure is created it cannot be changed • Thread safety is no longer an issue • Mutations produce side effects – (some you don’t know about) • Code is more maintainable – it just does what it does for (int x = 0 ; x < len(myList); x++) { myList2(x) = doSomething with myList } Oops! – “myList” is altered by another thread during the loop Thought for the day: HDFS is broken – I can’t change a file! Hurry up and fix it! (Are you sure?)
1010 Data-Centric Applications • Functional Programming important in data- centric applications • Big Data tends to be highly data-centric
1111 Hadoop Map Reduce Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper “Function” (jar file) HDFS Ship function to data list2 = map(function, list1) Shuffle/ Sort Reducer Reducer Reducer Reducer Reducer Reducer list3 = list2.reduceLeft(_ max _) FP & Big Data  Immutability  Recursion  Functions  Parallelism  No side effects
1212 <OPINION> Gen 1 Gen 2 Gen 3 Transform Map/Reduce (Java) Spark (Scala) Sparkling (Clojure) Data Flow Pig Cascading (Java) Scalding (Scala) Cascalog (Clojure) CEP Storm (Java) ScalaStorm (Scala) Storm (Clojure DSL) Messaging Kafka (Java) Kafka(Scala) Kafka (Clojure) </OPINION> Gravitate to same language Increasingly Functional Over Time Big Data Framework “disappears” into Language
1313 Case Study  Hadoop job to scan transactions within date range specified as argument to script  Output a record for each transaction within the range which has: – “closed_date” within the range specified – add “type” of “CLOSED” – “open_date” within the range specified – add “type” of “OPENED” – if one record has both dates within the range specified, output 2 records – 1 with each “type” Cust ID Open Date Close Date 101 2014/02/01 null 202 2014/01/12 2014/02/22 303 2014/02/02 2014/02/28 404 2014/01/04 2014/03/14 Cust ID Open Date Close Date Type 101 2014/02/01 null OPENED 202 2014/01/12 2014/02/22 CLOSED 303 2014/02/02 2014/02/28 OPENED 303 2014/02/02 2014/02/28 CLOSED Range specified: 2014/02/01 through 2014/02/28
1414 Case Study – Version 1  Bash script date arithmetic to create multiple FOR LOOPs – In LOOP executes “OPENED” Pig script – once per day in range – In LOOP # 2, executes “CLOSED” Pig script – once per day in range – So…for this example, a Pig script called 56 times  Pig script: – FILTER input BY (close or open) date for respective script – Call Java UDF for transformation – assigns “Type” depending on Pig script version  Outputs merged to same directory outside Pig  Results were 100% correct from business standpoint  To process a range of 1 month ~16 hours
1515 Case Study – Version 2  Bash script passes date range to Pig script, calls 1 script 1 time  Pig script: – FILTER input BY close or open date within range (either or both match) – Call Java UDF for transformation – assigns one or two “Type” depending on date values  Results compared to “Version 1” code – matched exactly….unless I ran late in the day?!  Side effect!  Results were 100% correct from business standpoint  To process a range of 1 month ~30 minutes  3% of Version 1 runtime
Functional Programming and Big Data Dan Marshall Big Data Applications Engineer June 11, 2015 © 2015 Wells Fargo Bank, N.A. All rights reserved. For public use.

Functional Programming and Big Data

  • 1.
    Functional Programming and BigData Dan Marshall Big Data Applications Engineer June 11, 2015 © 2015 Wells Fargo Bank, N.A. All rights reserved. For public use.
  • 2.
    11 Agenda  Functional vsOther Models  Functional Thinking and Big Data  Case Study
  • 3.
    22 Functional Programming  It’snew and shiny! It’s the latest thing!  Not so much… – Mathematical foundations introduced with lambda calculus in the 1930s – Lisp language created at MIT in 1958 – Google based much of original Map/Reduce research on well-established Functional Programming concepts  “Pure” Functional Languages – Common Lisp, Haskell, F#, Erlang – Can be mathematically proven to be correct (as opposed to “Design Patterns”)  Hybrids – Scala, Clojure  Non-functional (mostly – with increasing FP capabilities) – C, C++, Java, Python This presentation focuses on real life application – not academic theory. “Think functionally” is most important to start.
  • 4.
    33 Functional Programming  First-classfunctions – Functions are “first-class citizens” and can be used anywhere in the code – Function can be an argument to another function – or returned as a result from a function  Pure functions – No side effects – Facilitates parallel and/or concurrent operations and optimizations  Recursion – Iteration done through recursion (as opposed to looping)  Immutability – Avoid changing state and avoid mutable data  Declarative (vs Imperative) – What to do, not how to do it http://en.wikipedia.org/wiki/Functional_programming
  • 5.
  • 6.
    55 Simple Scala Example Listof Hadoop Summit attendees Define method Function literal Pass function Pass alternate function
  • 7.
  • 8.
    77 Pure Functions Y =f(x) 1. The function always evaluates the same result value given the same argument value(s). The function result value cannot depend on any hidden information… 2. Evaluation of the result does not cause any semantically observable side effect or output, such as mutation of mutable objects… http://en.wikipedia.org/wiki/Pure_function  Parallelization – work in parallel, gather results  Referentially Transparent – results can be cached and/or reused safely  Lazy evaluation – Because of guarantee of no side effect, may be able to skip function
  • 9.
    88 Recursion • Important conceptin computer science – many nuances – big topic • For this discussion, think recursion vs iteration • Recursion in FP allows for: • “Iteration” without mutating • “Tail call” optimization to avoid blowing out the stack • Simpler code Bad: for (int x = 0 ; x < len(myList); x++) { myList2(x) = doSomething with myList… } Good: for (var x in myList): produce myList2 Better: myList2 = map(doSomething, myList)
  • 10.
    99 Immutability • Once anobject/structure is created it cannot be changed • Thread safety is no longer an issue • Mutations produce side effects – (some you don’t know about) • Code is more maintainable – it just does what it does for (int x = 0 ; x < len(myList); x++) { myList2(x) = doSomething with myList } Oops! – “myList” is altered by another thread during the loop Thought for the day: HDFS is broken – I can’t change a file! Hurry up and fix it! (Are you sure?)
  • 11.
    1010 Data-Centric Applications • FunctionalProgramming important in data- centric applications • Big Data tends to be highly data-centric
  • 12.
    1111 Hadoop Map Reduce Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper Mapper “Function” (jarfile) HDFS Ship function to data list2 = map(function, list1) Shuffle/ Sort Reducer Reducer Reducer Reducer Reducer Reducer list3 = list2.reduceLeft(_ max _) FP & Big Data  Immutability  Recursion  Functions  Parallelism  No side effects
  • 13.
    1212 <OPINION> Gen 1 Gen2 Gen 3 Transform Map/Reduce (Java) Spark (Scala) Sparkling (Clojure) Data Flow Pig Cascading (Java) Scalding (Scala) Cascalog (Clojure) CEP Storm (Java) ScalaStorm (Scala) Storm (Clojure DSL) Messaging Kafka (Java) Kafka(Scala) Kafka (Clojure) </OPINION> Gravitate to same language Increasingly Functional Over Time Big Data Framework “disappears” into Language
  • 14.
    1313 Case Study  Hadoopjob to scan transactions within date range specified as argument to script  Output a record for each transaction within the range which has: – “closed_date” within the range specified – add “type” of “CLOSED” – “open_date” within the range specified – add “type” of “OPENED” – if one record has both dates within the range specified, output 2 records – 1 with each “type” Cust ID Open Date Close Date 101 2014/02/01 null 202 2014/01/12 2014/02/22 303 2014/02/02 2014/02/28 404 2014/01/04 2014/03/14 Cust ID Open Date Close Date Type 101 2014/02/01 null OPENED 202 2014/01/12 2014/02/22 CLOSED 303 2014/02/02 2014/02/28 OPENED 303 2014/02/02 2014/02/28 CLOSED Range specified: 2014/02/01 through 2014/02/28
  • 15.
    1414 Case Study –Version 1  Bash script date arithmetic to create multiple FOR LOOPs – In LOOP executes “OPENED” Pig script – once per day in range – In LOOP # 2, executes “CLOSED” Pig script – once per day in range – So…for this example, a Pig script called 56 times  Pig script: – FILTER input BY (close or open) date for respective script – Call Java UDF for transformation – assigns “Type” depending on Pig script version  Outputs merged to same directory outside Pig  Results were 100% correct from business standpoint  To process a range of 1 month ~16 hours
  • 16.
    1515 Case Study –Version 2  Bash script passes date range to Pig script, calls 1 script 1 time  Pig script: – FILTER input BY close or open date within range (either or both match) – Call Java UDF for transformation – assigns one or two “Type” depending on date values  Results compared to “Version 1” code – matched exactly….unless I ran late in the day?!  Side effect!  Results were 100% correct from business standpoint  To process a range of 1 month ~30 minutes  3% of Version 1 runtime
  • 17.
    Functional Programming and BigData Dan Marshall Big Data Applications Engineer June 11, 2015 © 2015 Wells Fargo Bank, N.A. All rights reserved. For public use.