Open Source Software for Data Scientists Charlie Greenbacker, Director of Data Science02 Apr 2014
Altamira Technologies Corporation 2014 Agenda ■  What is a Data Scientist? ■  Why use Open Source Software? ■  Survey of Open Source Software Tools: ¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization
Altamira Technologies Corporation 2014 About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable photo: Columbia Pictures
Altamira Technologies Corporation 2014 Best reason for not finishing PhD
Altamira Technologies Corporation 2014 @ExploreAltamira
What is a Data Scientist?
credit: Drew Conway (http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/ Paul Cooper, ITProPortal.com “A data scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
Computer Programming Mathematics & Analytic Methodology Distributed Computing & Big Data Data Science StatisticalAnalysis DataMining MachineLearning NaturalLanguageProcessing SocialNetworkAnalysis DataVisualization Domain Knowledge & Communication Skills etc.Altamira Technologies Corporation 2014
Why use Open Source Software?
photo: Karen (https://flic.kr/p/5njby2) THERE ARE NO SILVER BULLETS."
photo: Paul Inkles (https://flic.kr/p/e2QMS5) IF YOUR BOSS BUYS SOMETHING," YOU DAMN WELL BETTER USE IT."
photo: Valugi (http://bit.ly/1jrvVBC) BUDGETS DON’T SCALE."
Survey of OSS Tools
Altamira Technologies Corporation 2014 Statistical Analysis ■  Name: R ■  Creator: Gentleman, Ihaka, et al. ■  License: GPL Version 2 ■  Website: r-project.org ■  Source: cran.us.r-project.org/src/base/ ■  Features: ¤  Language & environment for statistical computing & viz ¤  Linear and nonlinear modeling, classical statistical tests, time-series analysis, graphical techniques, and more… ¤  5000+ packages available in CRAN repository
Altamira Technologies Corporation 2014 Data Mining ■  Name: Pandas ■  Creator: Wes McKinney, et al. ■  License: BSD 3-Clause License ■  Website: pandas.pydata.org ■  Source: github.com/pydata/pandas ■  Features: ¤  Data analysis workflow in Python ¤  DataFrame object for fast manipulation & indexing ¤  Tools for reading & writing data between formats ¤  Label-based slicing, indexing, and subsetting of data
Altamira Technologies Corporation 2014 Data Mining ■  Name: Impala ■  Creator: Cloudera ■  License: Apache License 2.0 ■  Website: impala.io ■  Source: github.com/cloudera/impala ■  Features: ¤  MPP query engine implemented on Hadoop ¤  Low latency, high concurrency SQL & BI queries ¤  Same interfaces as Apache Hive, but ~24x faster ¤  Written in C++; does not use MapReduce
Altamira Technologies Corporation 2014 Machine Learning ■  Name: Mahout ■  Creator: ASF ■  License: Apache License 2.0 ■  Website: mahout.apache.org ■  Source: svn.apache.org/viewvc/mahout ■  Features: ¤  Distributed/scalable ML library for Hadoop ¤  Classification, Clustering, Collaborative filtering ¤  Logistic regression, naïve Bayes, random forest, neural networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
Altamira Technologies Corporation 2014 Machine Learning ■  Name: Scikit-learn ■  Creator: Cournapeau, et al. ■  License: BSD 3-Clause License ■  Website: scikit-learn.org ■  Source: github.com/scikit-learn/scikit-learn ■  Features: ¤  ML library for Python built on NumPy, SciPy, matplotlib ¤  Support for classification, clustering, dimensionality reduction, regression, model selection, preprocessing ¤  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
Altamira Technologies Corporation 2014 Machine Learning + NLP ■  Name: Mallet ■  Creator: UMass (McCallum, et al.) ■  License: Common Public License 1.0 ■  Website: mallet.cs.umass.edu ■  Source: hg-iesl.cs.umass.edu/hg/mallet ■  Features: ¤  Java-based “Machine Learning for Language Toolkit” ¤  Document classification, clustering, topic modeling, information extraction & sequence tagging, etc. ¤  Efficient implementation of LDA for topic modeling
Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: NLTK ■  Creator: Bird, Loper, et al. ■  License: Apache License 2.0 ■  Website: nltk.org ■  Source: github.com/nltk/nltk ■  Features: ¤  Natural Language Toolkit for Python ¤  Built-in support for dozens of corpora & trained models ¤  Libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
Altamira Technologies Corporation 2014 Natural Language Processing ■  Name: Stanford CoreNLP ■  Creator: Stanford NLP Group ■  License: GPL Version 2 ■  Website: nlp.stanford.edu/software/corenlp.shtml ■  Source: github.com/stanfordnlp/CoreNLP ■  Features: ¤  Suite of high-quality, Java-based NLP tools ¤  Includes POS tagger, named entity recognizer, parser, coreference resolution, sentiment analysis, SUTime, etc. ¤  Includes models for English, Chinese, Arabic, German
Altamira Technologies Corporation 2014 NLP + Geospatial Analysis ■  Name: CLAVIN ■  Creator: Berico Technologies ■  License: Apache License 2.0 ■  Website: clavin.io ■  Source: github.com/Berico-Technologies/CLAVIN ■  Features: ¤  Extracts location names from text, resolves to gazetteer ¤  Employs context-based geospatial entity resolution ¤  ~75% accuracy, processes 1M documents per hour ¤  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
Altamira Technologies Corporation 2014 Social Network Analysis ■  Name: NetworkX ■  Creator: Los Alamos National Lab ■  License: BSD 3-Clause License ■  Website: networkx.github.io ■  Source: github.com/networkx/networkx ■  Features: ¤  Python structures for graphs, digraphs, & multigraphs ¤  Support for creating, manipulating, & analyzing the structure, dynamics, & functions of complex networks ¤  Provides standard graph algorithms & analysis metrics
Altamira Technologies Corporation 2014 Social Network Analysis ■  Name: Gephi ■  Creator: UTC France ■  License: GPL Version 3 ■  Website: gephi.org ■  Source: github.com/gephi/gephi ■  Features: ¤  Network analysis and visualization package for Java ¤  Dynamic network analysis with temporal filtering ¤  Metrics include: community detection, betweenness, closeness, clustering coefficient, PageRank, etc.
Altamira Technologies Corporation 2014 Data Visualization ■  Name: D3.js ■  Creator: Mike Bostock ■  License: BSD 3-Clause License ■  Website: d3js.org ■  Source: github.com/mbostock/d3 ■  Features: ¤  JavaScript library based on HTML, SVG, and CSS ¤  Binds data to DOM & enables transformations ¤  ~200 examples, including: force-directed graphs, choropleths, treemaps, dendrograms, animations, etc.
Altamira Technologies Corporation 2014 Fusion, Analysis, and Visualization ■  Name: Lumify ■  Creator: Altamira ■  License: Apache License 2.0 ■  Website: lumify.io ■  Source: github.com/altamiracorp/lumify ■  Features: ¤  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤  Integrates structured data, text, images, video ¤  Cell-level security & access controls ¤  Live, shared collaborative workspaces
Altamira Technologies Corporation 2014 Final Thought… Save your $$$ for: ¨  People ¤  salaries, training, etc. ¨  Resources ¤  hardware, AWS, etc. ¨  Proprietary software ¤  if no viable OSS alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ) FINAL THOUGHT Springer’s
open source software for data scientists oss4ds.com
Charlie Greenbacker | @greenbacker oss4ds.com

Open Source Software for Data Scientists -- Great Wide Open 2014

  • 1.
    Open Source Software forData Scientists Charlie Greenbacker, Director of Data Science02 Apr 2014
  • 2.
    Altamira Technologies Corporation2014 Agenda ■  What is a Data Scientist? ■  Why use Open Source Software? ■  Survey of Open Source Software Tools: ¤ Statistical Analysis ¤ Data Mining ¤ Machine Learning ¤ Natural Language Processing ¤ Social Network Analysis ¤ Data Visualization
  • 3.
    Altamira Technologies Corporation2014 About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable photo: Columbia Pictures
  • 4.
    Altamira Technologies Corporation2014 Best reason for not finishing PhD
  • 5.
    Altamira Technologies Corporation2014 @ExploreAltamira
  • 6.
    What is aData Scientist?
  • 10.
    credit: Drew Conway(http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
  • 11.
    http://www.itproportal.com/2014/02/11/how-to-pick-a-data-scientist-the-right-way/ Paul Cooper, ITProPortal.com “Adata scientist is someone who understands the domains of programming, machine learning, data mining, statistics, and hacking”
  • 12.
    Computer Programming Mathematics &Analytic Methodology Distributed Computing & Big Data Data Science StatisticalAnalysis DataMining MachineLearning NaturalLanguageProcessing SocialNetworkAnalysis DataVisualization Domain Knowledge & Communication Skills etc.Altamira Technologies Corporation 2014
  • 13.
    Why use OpenSource Software?
  • 14.
  • 15.
    photo: Paul Inkles(https://flic.kr/p/e2QMS5) IF YOUR BOSS BUYS SOMETHING," YOU DAMN WELL BETTER USE IT."
  • 16.
  • 17.
  • 18.
    Altamira Technologies Corporation2014 Statistical Analysis ■  Name: R ■  Creator: Gentleman, Ihaka, et al. ■  License: GPL Version 2 ■  Website: r-project.org ■  Source: cran.us.r-project.org/src/base/ ■  Features: ¤  Language & environment for statistical computing & viz ¤  Linear and nonlinear modeling, classical statistical tests, time-series analysis, graphical techniques, and more… ¤  5000+ packages available in CRAN repository
  • 19.
    Altamira Technologies Corporation2014 Data Mining ■  Name: Pandas ■  Creator: Wes McKinney, et al. ■  License: BSD 3-Clause License ■  Website: pandas.pydata.org ■  Source: github.com/pydata/pandas ■  Features: ¤  Data analysis workflow in Python ¤  DataFrame object for fast manipulation & indexing ¤  Tools for reading & writing data between formats ¤  Label-based slicing, indexing, and subsetting of data
  • 20.
    Altamira Technologies Corporation2014 Data Mining ■  Name: Impala ■  Creator: Cloudera ■  License: Apache License 2.0 ■  Website: impala.io ■  Source: github.com/cloudera/impala ■  Features: ¤  MPP query engine implemented on Hadoop ¤  Low latency, high concurrency SQL & BI queries ¤  Same interfaces as Apache Hive, but ~24x faster ¤  Written in C++; does not use MapReduce
  • 21.
    Altamira Technologies Corporation2014 Machine Learning ■  Name: Mahout ■  Creator: ASF ■  License: Apache License 2.0 ■  Website: mahout.apache.org ■  Source: svn.apache.org/viewvc/mahout ■  Features: ¤  Distributed/scalable ML library for Hadoop ¤  Classification, Clustering, Collaborative filtering ¤  Logistic regression, naïve Bayes, random forest, neural networks, HMM, k-means, SVD, PCA, ALS, LDA, etc.
  • 22.
    Altamira Technologies Corporation2014 Machine Learning ■  Name: Scikit-learn ■  Creator: Cournapeau, et al. ■  License: BSD 3-Clause License ■  Website: scikit-learn.org ■  Source: github.com/scikit-learn/scikit-learn ■  Features: ¤  ML library for Python built on NumPy, SciPy, matplotlib ¤  Support for classification, clustering, dimensionality reduction, regression, model selection, preprocessing ¤  SVM, k-NN, PCA, NNMF, crossval, feature extraction, ...
  • 23.
    Altamira Technologies Corporation2014 Machine Learning + NLP ■  Name: Mallet ■  Creator: UMass (McCallum, et al.) ■  License: Common Public License 1.0 ■  Website: mallet.cs.umass.edu ■  Source: hg-iesl.cs.umass.edu/hg/mallet ■  Features: ¤  Java-based “Machine Learning for Language Toolkit” ¤  Document classification, clustering, topic modeling, information extraction & sequence tagging, etc. ¤  Efficient implementation of LDA for topic modeling
  • 24.
    Altamira Technologies Corporation2014 Natural Language Processing ■  Name: NLTK ■  Creator: Bird, Loper, et al. ■  License: Apache License 2.0 ■  Website: nltk.org ■  Source: github.com/nltk/nltk ■  Features: ¤  Natural Language Toolkit for Python ¤  Built-in support for dozens of corpora & trained models ¤  Libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning
  • 25.
    Altamira Technologies Corporation2014 Natural Language Processing ■  Name: Stanford CoreNLP ■  Creator: Stanford NLP Group ■  License: GPL Version 2 ■  Website: nlp.stanford.edu/software/corenlp.shtml ■  Source: github.com/stanfordnlp/CoreNLP ■  Features: ¤  Suite of high-quality, Java-based NLP tools ¤  Includes POS tagger, named entity recognizer, parser, coreference resolution, sentiment analysis, SUTime, etc. ¤  Includes models for English, Chinese, Arabic, German
  • 26.
    Altamira Technologies Corporation2014 NLP + Geospatial Analysis ■  Name: CLAVIN ■  Creator: Berico Technologies ■  License: Apache License 2.0 ■  Website: clavin.io ■  Source: github.com/Berico-Technologies/CLAVIN ■  Features: ¤  Extracts location names from text, resolves to gazetteer ¤  Employs context-based geospatial entity resolution ¤  ~75% accuracy, processes 1M documents per hour ¤  Built on Hadoop, CoreNLP, OpenNLP, GeoNames.org
  • 27.
    Altamira Technologies Corporation2014 Social Network Analysis ■  Name: NetworkX ■  Creator: Los Alamos National Lab ■  License: BSD 3-Clause License ■  Website: networkx.github.io ■  Source: github.com/networkx/networkx ■  Features: ¤  Python structures for graphs, digraphs, & multigraphs ¤  Support for creating, manipulating, & analyzing the structure, dynamics, & functions of complex networks ¤  Provides standard graph algorithms & analysis metrics
  • 28.
    Altamira Technologies Corporation2014 Social Network Analysis ■  Name: Gephi ■  Creator: UTC France ■  License: GPL Version 3 ■  Website: gephi.org ■  Source: github.com/gephi/gephi ■  Features: ¤  Network analysis and visualization package for Java ¤  Dynamic network analysis with temporal filtering ¤  Metrics include: community detection, betweenness, closeness, clustering coefficient, PageRank, etc.
  • 29.
    Altamira Technologies Corporation2014 Data Visualization ■  Name: D3.js ■  Creator: Mike Bostock ■  License: BSD 3-Clause License ■  Website: d3js.org ■  Source: github.com/mbostock/d3 ■  Features: ¤  JavaScript library based on HTML, SVG, and CSS ¤  Binds data to DOM & enables transformations ¤  ~200 examples, including: force-directed graphs, choropleths, treemaps, dendrograms, animations, etc.
  • 30.
    Altamira Technologies Corporation2014 Fusion, Analysis, and Visualization ■  Name: Lumify ■  Creator: Altamira ■  License: Apache License 2.0 ■  Website: lumify.io ■  Source: github.com/altamiracorp/lumify ■  Features: ¤  Built on Hadoop, Storm, Accumulo, Elasticsearch, etc. ¤  Integrates structured data, text, images, video ¤  Cell-level security & access controls ¤  Live, shared collaborative workspaces
  • 32.
    Altamira Technologies Corporation2014 Final Thought… Save your $$$ for: ¨  People ¤  salaries, training, etc. ¨  Resources ¤  hardware, AWS, etc. ¨  Proprietary software ¤  if no viable OSS alternative exists photo: Brett Weinstein (http://bit.ly/1dHXvqJ) FINAL THOUGHT Springer’s
  • 33.
    open source softwarefor data scientists oss4ds.com
  • 34.
    Charlie Greenbacker |@greenbacker oss4ds.com