Array computing and the evolution of SciPy, NumPy, and PyData

© 2017 Continuum Analytics - Confidential & Proprietary Array Computing and the Evolution of SciPy, NumPy, and PyData Travis E. Oliphant, PhD February 13, 2020 travis@quansight.com @teoliphant Distinguished Lecture Columbia University travis@openteams.com

Published: February 3, 2020 Project Started: 1998 Patience and Persistence and Grit

1998 20182001 2015 2009 20122005 … 2001 2006 SciPy, NumPy, and PyData Time-Line 1991 2003 2014 2008 2010 2016 2009

Started my career in computational science Satellites Measure Backscatter Computer Algorithms Produce Estimate of Earth Features • Wind Speed • Ice Cover • Vegetation • (and more)

More Science led to Python Raja Muthupillai Armando Manduca Richard Ehman 1997 Jim Greenleaf

First Project (1998 — ) Started as Multipack in 1998 and became SciPy in 2001 with the help of other colleagues 115 releases, 815 contributors Used by: 156,525

SciPy “Distribution of Python Numerical Tools masquerading as one Library” Name Description cluster KMeans and Vector Quantization fftpack Discrete Fourier Transform integrate Numerical Integration interpolate Interpolation routines io Data Input and Output linalg Fast Linear algebra misc Utilities ndimage N-dimensional Image processing Name Description odr Orthogonal Distance Regression optimize Constrained and Unconstrained Optimization signal Signal Processing Tools sparse Sparse Matrices and Algebra spatial Spatial Data Structures and Algorithms special Special functions (e.g. Bessel) stats Statistical Functions and Distributions

Professor at BYU Scanning Impedance Imaging

My Open Source addiction continued… Gave up my chance at tenured academic position in 2005-2006 to bring together the diverging array community in Python and bring Numeric and Numarray together. 166 releases, 866 contributors Used by: 314,759

NumPy: an Array Extension of Python • Data: the array object – slicing and shaping – data-type map to bytes (dtype) • Fast Math (ufuncs): – vectorization – broadcasting – aggregations

Brief History of NumPy Person Package Year Jim Fulton Matrix Object 1994 Jim Hugunin Numeric 1995 Perry Greenfield, Rick White,Todd Miller Numarray 2001 Travis Oliphant NumPy 2005

NumPy was created to unify array objects in Python and unify PyData community Numeric Numarray NumPy I started this unification project and ended up sacrificing my tenure at a University to write and release NumPy.

My little “side projects” became my life

Making “Array Oriented Programming” Popular renamed ~20 million (Ana)conda users spun-out

Past 5 years have seen a resurgence of array-oriented computing because of… Machine Learning and AI

Java JavaScript Python Google Search Trends Jun 2019

NumPy Tensorflow Scikit Learn PyTorch NumPy Pandas

Python and in particular PyData keeps Growing

Python’s Scientific Ecosystem Bokeh Jake Vanderplas PyCon 2017 Keynote

Not all open-source is the same! Community-Driven Open Source Software (CDOSS) Company-Backed Open Source Software (CBOSS) • Anyone can become the leader. • Multiple-stake holders. • Can look at community size for health. • Users become contributors more often. • Examples: • Jupyter • NumPy • SciPy • Pandas • Need to work at a company to be the leader, • Many users, fewer developers • Need to understand incentive of company to understand health • Examples: • Tensorflow • PyTorch • Conda Both can be valuable, but have different implications! Governance models

Huge Impact (from diverse efforts of 1000s) LIGO : Gravitional Waves Higgs Boson Discovery Black Hole Imaging

Example — Amazon Photo Automatic Facial recognition User feedback on face names updates model

Neural network with several layers trained with ~130,000 images. Matched trained dermatologists with 91% area under sensitivity- specificity curve. Keys: • Access to Data • Access to Software • Access to Compute

Python has taken over! Thanks to 1000s of of my “closest” friends who worked on all the libraries We won! (sort of)

Downloads 49 Million Estimated Cost $7.57 Million Contributors 866 Estimated Effort 76 person-years 3 Current Maintainers Downloads 27.7 Million Estimated Cost $7 Million Contributors 1,666 Estimated Effort 70 person-years 3 Current Maintainers Downloads 13.8 Million Estimated Cost $6.63 Million Contributors 860 Estimated Effort 64 person-years 2 Current Maintainers Development began in 2003 Development began in 2005 Development began in 2008 The original developers were not paid to work on or improve these libraries!

OSS Sustainability • Developers get “burned-out” when many people use their tools but there is no money to maintain or improve them. • Developers can live unbalanced lives. • Multi-billion dollar companies are benefiting from volunteer labor and not giving back. • Foundational libraries are not maintained and key insights from creators don’t get back into the code.

For example: Here was my list for NumPy in 2012 • NDArray improvements • Indexes (esp. for Structured arrays) • SQL front-end • Multi-level, hierarchical labels • selection via mappings (labeled arrays) • Memory spaces (array made up of regions) • Distributed arrays (global array) • Compressed arrays • Standard distributed persistence • fancy indexing as view and optimizations • streaming arrays • Dtype improvements • Enumerated types (including dynamic enumeration) • Derived fields • Specification as a class (or JSON) • Pointer dtype (i.e. C++ object, or varchar) • Finishing datetime • Missing data with both bit-patterns and mask • Parameterized field names • Ufunc improvements • Generalized ufuncs support more than just contiguous arrays • Specification of ufuncs in Python • Move most dtype “array functions” to ufuncs • Unify error-handling for all computations • Allow lazy-evaluation and remote computation --- streaming and generator data • Structured and string dtype ufuncs • Multi-core and GPU optimized ufuncs • Group-by reduction

Multiple other unrealized epiphanies… • In 2014, I finally realized how I should have built dtypes (inheriting from a new “meta-type” so all NumPy “dtypes" are actually real Python types. This would have eliminated the need for the “ugly” array-scalars (but semantically necessary in the current system). • NumPy should have a smaller interface API that other array libraries could implement instead of the entire API becoming a de facto array API. • GPU and parallel-executing UFuncs should be built-in • Apply-by and reduce-by should be NumPy functions. I’ve never received budget to work on NumPy or SciPy (until this year with a CZI grant from Facebook). Part of this is because I pursued other entrepreneurial mechanisms to generate resources, but part of this is because granting mechanisms are not setup to “maintain” community-driven open-source software.

1. Python is the “Lingua Franca” for technical computing and machine learning / AI 2. Python Reached this status because it embraced array- oriented computing (NumPy and Pandas) 3. "Emergent” community-driven Open-source has a sustainability problem. Major Conclusions: We (basically) realized our ultimate goal when we started SciPy in 1999! But, we are still searching for the means to sustain.

What is array-oriented computing • Organize data together logically (and in memory) • Operate on “chunks” at a time with high-level operations: (map, join, reduce, transform, apply, filter)

Memory using Object-oriented Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3

Array-oriented (Table) approach Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4 Object5 Object6

Benefits of Array-oriented • Many technical problems are naturally array- oriented (easy to vectorize) • Algorithms can be expressed at a high-level • These algorithms can be parallelized more simply (quite often much information is lost in the translation to typical “compiled” languages) • Array-oriented algorithms map to modern hard- ware caches and pipelines. • Software stack now starting to re-focus with ML frameworks emerging. • There is a reason Fortran remains popular.

NumPy Examples 2d array 3d array [439 472 477] [217 205 261 222 245 238] 9.98330639789 2.96677717122

NumPy Slicing (Selection) >>> a[0,3:5] array([3, 4]) >>> a[4:,4:] array([[44, 45], [54, 55]]) >>> a[:,2] array([2,12,22,32,42,52]) 50 51 52 53 54 55 40 41 42 43 44 45 30 31 32 33 34 35 20 21 22 23 24 25 10 11 12 13 14 15 0 1 2 3 4 5 >>> a[2::2,::2] array([[20, 22, 24], [40, 42, 44]])

Quick History life←{↑1 ⍵∨.∧3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵} 1966: APL 1984: APL2 1990: J 1993: K -> Q 2019: (new version of K) 1996: Numeric (Python) 2006: NumPy 2012: Numba Arthur Whitney (used by KDB) Arthur Whitney Jim Hugunin Travis Oliphant Siu Kwan Lam Ken Iverson Ken Iverson (IBM) IBM APL J K Matlab Numeric NumPy

Putting Science back in Comp Sci • Much of the software stack is for systems programming --- C++, Java, .NET, ObjC, web • This has been great for desktop computing but terrible for science: - Complex numbers? - Vectorized primitives? - Multidimensional arrays? • Array-oriented programming was supplanted by Object-oriented programming • Software stack for scientists was not as helpful as it should be • Fortran is still where many scientists ended up • Past 5 years this is changing with emergence of Python, Jupyter, Pandas, PyTorch (we still have a long way to go).

Array-Oriented Computing Example1: Fibonacci Numbers fn = fn 1 + fn 2 f0 = 0 f1 = 1 f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .

Common Python approaches Recursive Iterative Algorithm matters!!

Array-oriented approaches Using LFilter Using Formula

Conway’s game of Life • Dead cell with exactly 3 live neighbors will come to life • A live cell with 2 or 3 neighbors will survive • With too few or too many neighbors, the cell dies

Conway’s Game of Life APL NumPy Initialization Update Step life←{↑1 ⍵∨.∧3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵}

Zen of NumPy • strided is better than scattered • contiguous is better than strided • descriptive is better than imperative • array-oriented is better than object-oriented • broadcasting is a great idea • vectorized is better than an explicit loop • unless it’s too complicated or uses too much memory --- then use Numba • think in higher dimensions Inspired by Tim Peter and “import this”

What is good about NumPy? • Array-oriented • Extensive Dtype System (including structures) • C-API • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Broadcasting • Easy to interface C/C++/Fortran code

What is wrong with NumPy • Dtype system is difﬁcult to extend • Immediate mode creates huge temporaries • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPU • Code-base is organic and hard to extend • Tied to CPython run-time (doesn’t work on other Python implementations)

Python Origins. Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991 0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999

How I got involved… Getting data into memory — fast! http://www.python.org/doc/essays/refcnt/ Reference Counting Essay May 1998 Guido van Rossum TableIO April 1998 Michael A. Miller NumPyIO June 1998

How SciPy started… Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 sparse plan described 30 May 1999 multipack 0.7 14 Jun 1999 SparsePy 0.1 5 Nov 1999 cephes 1.2 (vectorize) 29 Dec 1999

Joined with others… Started as Multipack in 1998 and became SciPy in 2001 with the help of other colleagues 115 releases, 815 contributors Used by: 156,525

Don’t underestimate the importance of Team! Anaconda success also depended on going from individual to a team >700 contributors

Other People Matter Know your model is incomplete: • see people as “ends” not your “means” • Believe in, love, and trust other people. The Social Brain Hypothesis and Human Evolution, Robin I. M. Dunbar Use your brain to adapt to other people — this is why your brain is so big! Hypothesis: You carry and update “models of people" in your head. From very detailed to approximate. Dunbar numbers!

Keep Open Mind: Be open to critique dtype ctypes PEP 3118 debate over how to describe memory vs. Current me disagrees with past me! I am glad there were others in the debate.

Return good for evil Hard because of our brains!

] https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose http://deeplearning.net/software_links/ http://scikit-learn.org/stable/related_projects.html Explosion of ML Frameworks and libraries TVM/NNVM

We have a “divided” community again! Numeric Numarray NumPy

Examples of packages being built on differing standards FastAI skorch Pyro Eduard anyrl Braid PyMC4 Horovod MLFlow But note

Unification Efforts Train the Model Deploy the Model Platform1 Platform 2 Deploy the Model Platform 3

NNVM / TVM — Ambitious Plan at UW

What is next? What am I working on for the next 20 years…

Technology and Economic problems 1. General interoperability — low-level libraries that reduce silos of data and analysis 2. Better High-level APIs (more interfaces in Python supported by multiple implementations) 3. Data Management — in particular Data Catalogues 4. Fixing Python’s Extension problem (the ecosystem helped Python grow but is also and anchor to it’s progress) 5. How to connect the trillions of dollars of market capital to the innovation available in global, emergent, open-source communities.

High Level APIs for Arrays (Tensors), DataFrames, and DataTypes LABS

The extensions are an anchor to Python runtime progress! CPython C-API

What will work! • Create a statically typed subset of Python that is then used to extend Python — EPython • Port NumPy, SciPy, Scikits to EPython (borrow heavily from Cython ideas but use mypy-style typing instead of new syntax).

LABS Sustaining the Future Open-source innovation and maintenance around the entire data- science and AI workﬂow. • NumPy ecosystem maintenance (PyData Core Team) • Improve connection of NumPy to ML Frameworks • GPU Support for NumPy Ecosystem • Improve foundations of Array computing • JupyterLab and JupyterHub • Data Catalog standards • Packaging (conda-forge, PyPA, etc.) PySparse - sparse n-d arrays Ibis - Pandas-like front-end to SQL uarray — unified array interface for SciPy refactor xnd — re-factored NumPy (low-level cross-language libraries for N-D (tensor) computing) Collaborating with NumFOCUS! Bokeh Adapted from Jake Vanderplas PyCon 2017 Keynote

Build and Connect Companies and Communities to Solve Challenging Problems with Data Enables me to keep working on array- computing problems *and* meta- problem of open-source funding.

Complete open-source service consulting in the PyData / NumFOCUS ecosystem including data-science and ML We provide part-time CTO work, custom software, staff augmentation, support, training, staffing, and mentoring Open Source Research Lab supporting the NumFOCUS and PyData Community. Hiring developers, evangelists, tech writers, designers, and product managers for open-source projects. Early stage funding to companies that provide return to investors and support open source ecosystems with industry disrupting products and services Services Open Source Lab Venture Fund Three Activities with One Mission

Some of the projects we support Sparse Fast Foundational ND-Array (Tensor) object for Python Extensive Library of Functions for NumPy GPU-enabled Compiler for NumPy/Python Parallel and Scaled Pandas and NumPy DataFrames for general data-manipulation and statistics and Notebook environments for rapid development and data analysis Desktop IDE for data-science and ML Rapid development of Dashboards for Python/PyData ecosystem. Easy and fast web-based interactive plots using Python. Turn even very large datasets into images, accurately. General Sparse Arrays for Python Cross-language libraries for array computing General and powerful symbolic mathematical library Very popular and powerful machine learning library

An early stage venture capital ﬁrm investing in startups that build on open-source technology and support the communities they depend on (11 companies) supporting FairOSS $20m fund

Problem Open Source Teams ! Burned out ! Underrepresented ! Underpaid Organizations ! Disconnected from the Community ! Lack support and maintenance There’s no easy way to connect the community with organizations

Open Source Marketplace Managing Partners ! Provide Open Source Services ! Training / Support ! Feature development / fixes Funding Partners ! Hire from the community ! Collectively fund ! Get support they need to build effectively on open- source. Open-source Contributors create profiles for themselves and their projects and participate as actors in the market.

FairOSS A Public Benefit Company (goal is growing amount of freely available software) • Owned by open-source contributors (will be doing a public fund-raise later this year) • Those share-holders govern the organization (elect the board). • Board appoints management and decides what is “fair” Holds Companies accountable • Allows usage of its trademarks only for companies that contribute back “fairly” • Think “Kosher” or “Organic labeled” • Companies give back by equity, revenue, and “in-kind” agreements with FairOSS FairOSS is custodian of Revenue and Equity Agreements • Equity agreements mean that FairOSS holds shares, options, or warrants of the company (most companies are missing open-source community from their ‘cap-table’) • Revenue agreements mean that companies pay FairOSS a portion of their revenue. • FairOSS distributes almost all of the proceeds from these agreements to the open- source communities. If successful — this would make OpenSource investable and make available >$23,000,000,000,000 (trillion) of investment capital to open-source communities.

You can really change the world… With Open Source Communities… Let’s do more of that!

Array computing and the evolution of SciPy, NumPy, and PyData

More Related Content

What's hot

Similar to Array computing and the evolution of SciPy, NumPy, and PyData

More from Travis Oliphant

Recently uploaded

Array computing and the evolution of SciPy, NumPy, and PyData