Wes McKinney introduced pandas, a Python data analysis library built on NumPy. Pandas provides data structures and tools for cleaning, manipulating, and working with relational and time-series data. Key features include DataFrame for 2D data, hierarchical indexing, merging and joining data, and grouping and aggregating data. Pandas is used heavily in financial applications and has over 1500 unit tests, ensuring stability and reliability. Future goals include better time series handling and integration with other Python data science packages.
Me • Recovering mathematician •3 years in the quant finance industry • Last 2: statistics + freelance + open source • My new company: Lambda Foundry • High productivity data analysis and research tools for quant finance
Agile Tools forReal World Data Wes McKinney Python for Data Analysis • Pragmatic intro to scientific Python • pandas • Case studies • ETA: Late 2012 In the works Agile Tools for Real World Data
5.
pandas? • http://pandas.pydata.org • Richrelational data tool built on top of NumPy • Like R’s data.frame on steroids • Excellent performance • Easy-to-use, highly consistent API • A foundation for data analysis in Python
6.
pandas • In heavyproduction use in the financial industry, among others • Generally much better performance than other open source alternatives (e.g. R) • Hope: basis for the “next generation” statistical computing and analysis environment
7.
Simplifying data wrangling •Data munging / preparation / cleaning / integration is slow, error prone, and time consuming • Everyone already <3’s Python for data wrangling: pandas takes it to the next level
Battle tested • >98% line coverage as measured by coverage.py • v0.3.0 (2/19/2011): 533 test functions
11.
Battle tested • >98% line coverage as measured by coverage.py • v0.3.0 (2/19/2011): 533 test functions • v0.7.3dev (3/27/2012): >1500 test functions
12.
IPython • Simply put:one of the hottest Python projects out there • Tab completion, introspection, interactive debugger, command history • Designed to enhance your productivity in every way. I can’t live without it • IPython HTML notebook is #winning
13.
Series • Subclass ofnumpy.ndarray • Data: any type • Index labels need not be ordered • Duplicates are possible (but result in reduced functionality) 5 6 12 -5 6.7 A B C D E valuesindex
14.
DataFrame • NumPy array-like •Each column can have a different type • Row and column index • Size mutable: insert and delete columns 0 4 8 -12 16 A B C D E index x y z w a 2.7 6 10 NA 18 True True False False False foo bar baz quxcolumns
15.
DataFrame In [10]: tips[:10] Out[10]: total_billtip sex smoker day time size 1 16.99 1.01 Female No Sun Dinner 2 2 10.34 1.66 Male No Sun Dinner 3 3 21.01 3.50 Male No Sun Dinner 3 4 23.68 3.31 Male No Sun Dinner 2 5 24.59 3.61 Female No Sun Dinner 4 6 25.29 4.71 Male No Sun Dinner 4 7 8.770 2.00 Male No Sun Dinner 2 8 26.88 3.12 Male No Sun Dinner 4 9 15.04 1.96 Male No Sun Dinner 2 10 14.78 3.23 Male No Sun Dinner 2
16.
DataFrame • Axis indexingenable rich data alignment, joins / merges, reshaping, selection, etc. day Fri Sat Sun Thur sex smoker Female No 3.125 2.725 3.329 2.460 Yes 2.683 2.869 3.500 2.990 Male No 2.500 3.257 3.115 2.942 Yes 2.741 2.879 3.521 3.058
17.
Axis indexing, thespecial pandas-flavored sauce • Enables “alignment-free” programming • Prevents major source of data munging frustration and errors • Fast data selection • Powerful way of describing reshape / join / merge / pivot-table operations
18.
Data alignment • Binaryoperations are joins! B C D E 1 2 3 4 A B C D 0 1 2 3 + = A B C D NA 2 4 6 E NA
19.
GroupBy A 0 B 5 C10 5 10 15 10 15 20 A A A B B B C C C A 15 B 30 C 45 A B C A B C 0 5 10 5 10 15 10 15 20 sum ApplySplit Key Combine sum sum
20.
Hierarchical indexes • Semantics:a tuple at each tick • Enables easy group selection • Terminology:“multiple levels” • Natural part of GroupBy and reshape operations A 1 2 3 1 2 3 4 B
21.
Hierarchical indexes • Semantics:a tuple at each tick • Enables easy group selection • Terminology:“multiple levels” • Natural part of GroupBy and reshape operations A 1 2 3 1 2 3 4 B { {
Hmm, pandas/src • ~6000lines of mostly Cython code • Fast data algorithms that power the library and make it fast • pandas in PyPy?
30.
Ok, so whyPython? • Look around you! • Build a superior data analysis and statistical computing environment • Build mission-critical, data-driven production systems
The pandas roadmap •Improved time series capabilities • Port GroupBy engine to NumPy only • Better integration with statsmodels and scikit-learn • R integration via rpy2
ggplot2 for Python •We need to build better a better interface for creating statistical graphics in Python • Use pandas as the base layer ! • Upcoming project from Peter Wang: bokeh
35.
pandas for “BigData” • Quite common to need to process larger- than-RAM data sets • Alternate DataFrame backends are the likely solution • Ripe for integration with MapReduce frameworks
36.
Better time series •Integration of scikits.timeseries codebase • NumPy datetime64 dtype • Higher performance, less memory
37.
Better time series •Fixed frequency handling • Time zones • Multiple time concepts • Intervals: 1984, or “1984 Q4” • Timestamps: moment in time, to micro- or nanosecond resolution