Visualizing big data in the browser using Spark Hossein Falaki @mhfalaki Spark Summit East – March 18, 2015
Exploratory Visualization Put visualization back in the normal workflow of data analysis regardless of data size. 2 “Critical part of data analysis” —William S. Cleveland • Interactive • Collaborative • Reproducible
Expository Visualization 3 Communication is often the bottleneck in data science, 
 and a graph is worth a thousand words. • Control over details • Shareable
Requirements • Interactive • Collaborative • Shareable • Reproducible • Control over details 4 Use visualization libraries } Use the browser }
Visualization as programming • For complex tasks point and click may not be enough • Best expressed with a grammar (API) • Scripts are reproducible • Control over all details • Data scientists are already familiar with these tools 5 D3.js, Three.js, matplotlib, ggplot, Bokeh, Vincent, …
Do it in the browser • Output of these tools can be readily used on the web 
 (PNG, SVG, Canvas, WebGL) • No need to transfer data and results • Browser is conducive to collaboration (e.g., Notebooks) • Separating data manipulation from rendering enables users to freely choose the best tool for each job 6
Challenges with big data visualization 1.Manipulating large data can take a long time 2.We have more data points than pixels 7 Apache Spark can help solve both problems
Challenges 1. Manipulating large data can take a long time 8 > Memory > CPU
Reducing latency: caching Take advantage of memory and storage hierarchy 9 • Serialized storage levels (for memory) • Memory & GC tuning
Reducing latency: parallelism 10 Increase number of CPUs > Get more executors with Mesos or Yarn > Click a button to increase cluster size in DBC • Control level of parallelism for map and reduce tasks • Configure spark locality if needed
Challenges 1. Manipulating large data can take a long time 2. We have more data points than possible pixels 11 > Summarize > Model > Sample
More data than pixels? Summarize • Extensively used by BI tools > Aggregation > Pivoting • Most data scientists’ nightly jobs summarize data 12
More data than pixels? Model MLLib supports a large (and growing) set of distributed algorithms • Clustering: k-means, GMM, LDA • Classification and regression: 
 LM, DT, NB • Dimensionality reduction: SVD, PCA • Collaborative filtering: ALS • Correlation, hypothesis testing 13
More data than pixels? Sample Extensively used in statistics Spark offers native support for: • Approximate and exact sampling • Approximate and exact stratified sampling Approximate sampling is faster 
 and is good enough in most cases 14
Demo 15
Summary Using Spark we can extend interactive visualization of large data Reduce interaction latency to seconds > Cache data in memory > Increase parallelism To visualize millions of points in the browser > Summarize > Model > Sample 16
Visualizing big data in the browser using Spark

Visualizing big data in the browser using spark

  • 1.
    Visualizing big datain the browser using Spark Hossein Falaki @mhfalaki Spark Summit East – March 18, 2015
  • 2.
    Exploratory Visualization Put visualizationback in the normal workflow of data analysis regardless of data size. 2 “Critical part of data analysis” —William S. Cleveland • Interactive • Collaborative • Reproducible
  • 3.
    Expository Visualization 3 Communication isoften the bottleneck in data science, 
 and a graph is worth a thousand words. • Control over details • Shareable
  • 4.
    Requirements • Interactive • Collaborative •Shareable • Reproducible • Control over details 4 Use visualization libraries } Use the browser }
  • 5.
    Visualization as programming •For complex tasks point and click may not be enough • Best expressed with a grammar (API) • Scripts are reproducible • Control over all details • Data scientists are already familiar with these tools 5 D3.js, Three.js, matplotlib, ggplot, Bokeh, Vincent, …
  • 6.
    Do it inthe browser • Output of these tools can be readily used on the web 
 (PNG, SVG, Canvas, WebGL) • No need to transfer data and results • Browser is conducive to collaboration (e.g., Notebooks) • Separating data manipulation from rendering enables users to freely choose the best tool for each job 6
  • 7.
    Challenges with bigdata visualization 1.Manipulating large data can take a long time 2.We have more data points than pixels 7 Apache Spark can help solve both problems
  • 8.
    Challenges 1. Manipulating largedata can take a long time 8 > Memory > CPU
  • 9.
    Reducing latency: caching Takeadvantage of memory and storage hierarchy 9 • Serialized storage levels (for memory) • Memory & GC tuning
  • 10.
    Reducing latency: parallelism 10 Increasenumber of CPUs > Get more executors with Mesos or Yarn > Click a button to increase cluster size in DBC • Control level of parallelism for map and reduce tasks • Configure spark locality if needed
  • 11.
    Challenges 1. Manipulating largedata can take a long time 2. We have more data points than possible pixels 11 > Summarize > Model > Sample
  • 12.
    More data thanpixels? Summarize • Extensively used by BI tools > Aggregation > Pivoting • Most data scientists’ nightly jobs summarize data 12
  • 13.
    More data thanpixels? Model MLLib supports a large (and growing) set of distributed algorithms • Clustering: k-means, GMM, LDA • Classification and regression: 
 LM, DT, NB • Dimensionality reduction: SVD, PCA • Collaborative filtering: ALS • Correlation, hypothesis testing 13
  • 14.
    More data thanpixels? Sample Extensively used in statistics Spark offers native support for: • Approximate and exact sampling • Approximate and exact stratified sampling Approximate sampling is faster 
 and is good enough in most cases 14
  • 15.
  • 16.
    Summary Using Spark wecan extend interactive visualization of large data Reduce interaction latency to seconds > Cache data in memory > Increase parallelism To visualize millions of points in the browser > Summarize > Model > Sample 16
  • 17.
    Visualizing big datain the browser using Spark