Simple big data, in Python Ga¨el Varoquaux
Simple big data, in Python Ga¨el Varoquaux This is a lie!
Please allow me to introduce myself I’m a man of wealth and taste I’ve been around for a long, long year Physicist gone bad Neuroscience, Machine learning Worked in a software startup Enthought: scientific computing consulting in Python Coder (done my share of mistake) Mayavi, scikit-learn, joblib... Scipy community Chair of scipy and EuroScipy conferences Researcher (PI) at INRIA G Varoquaux 2
1 Machine learning in 2 words 2 Scikit-learn 3 Big data on a budget 4 The bigger picture: a community G Varoquaux 3
1 Machine learning in 2 words G Varoquaux 4
1 A historical perspective Artificial Intelligence The 80s Building decision rules Eatable? Mobile? Tall? G Varoquaux 5
1 A historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations G Varoquaux 5
1 A historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations G Varoquaux 5
1 A historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules G Varoquaux 5
1 A historical perspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules “Big data isn’t actually interesting without machine learning” Steve Jurvetson, VC, Silicon Valley G Varoquaux 5
1 Machine learning Example: face recognition Andrew Bill Charles Dave G Varoquaux 6
1 Machine learning Example: face recognition Andrew Bill Charles Dave ?G Varoquaux 6
1 A simple method 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method G Varoquaux 7
1 A simple method 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method How many errors on already-known images? ... 0: no erreurs Test data = Train data G Varoquaux 7
1 1st problem: noise Signal unrelated to the prediction problem 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Noise level 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Predictionaccuracy G Varoquaux 8
1 2nd problem: number of variables Finding a needle in a haystack 1 2 3 4 5 6 7 8 9 10 Useful fraction of the frame 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Predictionaccuracy G Varoquaux 9
1 Machine learning Example: face recognition Andrew Bill Charles Dave ? Learning from numerical descriptors Difficulties: i) noise, ii) number of features “supervised” task: known labels “unsupervised” task: unknown labels G Varoquaux 10
1 Machine learning: regression A single descriptor: one dimension x y G Varoquaux 11
1 Machine learning: regression A single descriptor: one dimension x y x y Which model to prefer? G Varoquaux 11
1 Machine learning: regression A single descriptor: one dimension x y x y Problem of “over-fitting” Minimizing error is not always the best strategy (learning noise) Test data = train data G Varoquaux 11
1 Machine learning: regression A single descriptor: one dimension x y x y Prefer simple models = concept of “regularization” Balance the number of parameters to learn with the amount of data G Varoquaux 11
1 Machine learning: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters G Varoquaux 11
1 Machine learning: regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters ⇒ need more data “curse of dimensionality” G Varoquaux 11
1 Supervised learning: classification Predicting categories, e.g. numbers X2 X1 G Varoquaux 12
1 Unsupervised learning Stock market structure G Varoquaux 13
1 Unsupervised learning Stock market structure Unlabeled data more common than labeled data G Varoquaux 13
1 Recommender systems G Varoquaux 14
1 Recommender systems Andrew Bill Charles Dave Edie Little overlap between users G Varoquaux 14
1 Machine learning Challenges Statistics Computational G Varoquaux 15
1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago G Varoquaux 16
1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago An in-house data science squad Difficulties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept G Varoquaux 16
1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago An in-house data science squad Difficulties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept We need big data (and machine learning) on a tight budget G Varoquaux 16
2 Scikit-learn Machine learning without learning the machinery c Theodore W. GrayG Varoquaux 17
2 My stack Python, what else? Interactive language Easy to read / write General purpose G Varoquaux 18
2 My stack Python, what else? The scientific Python stack numpy arrays pandas ... It’s about plugin things together G Varoquaux 18
2 scikit-learn vision Machine learning for all No specific application domain No requirements in machine learning High-quality software library Interfaces designed for users Community-driven development BSD licensed, very diverse contributors http://scikit-learn.org G Varoquaux 19
2 A Python library A library, not a program More expressive and flexible Easy to include in an ecosystem As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) G Varoquaux 20
2 Very rich feature set Supervised learning Decision trees (Random-Forest, Boosted Tree) Linear models SVM Unsupervised Learning Clustering Dictionary learning Outlier detection Model selection Built in cross-validation Parameter optimization G Varoquaux 21
2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies G Varoquaux 22
2 Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies Random Forest fit time 0 2000 4000 6000 8000 10000 12000 14000Fittime(s) 203.01 211.53 4464.65 3342.83 1518.14 1711.94 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java randomForest R, Fortran Orange Python Figure: Gilles Louppe G Varoquaux 22
3 Big data on a budget G Varoquaux 23
3 Big data on a budgetBiggish “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers G Varoquaux 23
3 Big data on a budgetBiggish “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers Simple data processing patterns G Varoquaux 23
“Big data”, but big how? 2 scenarios: Many observations –samples e.g. twitter Many descriptors per observation –features e.g. brain scans G Varoquaux 24
3 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? G Varoquaux 25
3 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? No: just do a running mean G Varoquaux 25
3 On-line algorithms Converges to expectations Mini-batch = bunch observations for vectorization Example: K-Means clustering X = np.random.normal(size=(10 000, 200)) scipy.cluster.vq. kmeans(X, 10, iter=2) 11.33 s sklearn.cluster. MiniBatchKMeans(n clusters=10, n init=2).fit(X) 0.62 s G Varoquaux 25
3 On-the-fly data reduction Big data is often I/O bound Layer memory access CPU caches RAM Local disks Distant storage Less data also means less work G Varoquaux 26
3 On-the-fly data reduction Dropping data 1 loop: take a random fraction of the data 2 run algorithm on that fraction 3 aggregate results across sub-samplings Looks like bagging: bootstrap aggregation Exploits redundancy across observations Run the loop in parallel G Varoquaux 26
3 On-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.WardAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 26
3 On-the-fly data reduction Example: randomized SVD Random projection sklearn.utils.extmath.randomized svd X = np.random.normal(size=(50000, 200)) %timeit lapack = linalg.svd(X, full matrices=False) 1 loops, best of 3: 6.09 s per loop %timeit arpack=splinalg.svds(X, 10) 1 loops, best of 3: 2.49 s per loop %timeit randomized = randomized svd(X, 10) 1 loops, best of 3: 303 ms per loop linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000 0.0022360679774997738 linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000 0.0022121161221386925 G Varoquaux 26
3 Data-parallel computing I tackle only embarrassingly parallel problem Life is too short to worry about deadlocks Stratification to follow the statistical dependencies and the data storage structure Batch size scaled by the relevent cache pool - Too fine ⇒ overhead - Too coarse ⇒ memory shortage joblib.Parallel G Varoquaux 27
3 Caching Minimizing data-access latency Never computing twice the same thing joblib.cache G Varoquaux 28
3 Fast access to data Stored representation consistent with access patterns Compression to limit bandwidth usage - CPU are faster than data access Data access speed is often more a limitation than raw processing power joblib.dump/joblib.load G Varoquaux 29
3 Biggish iron Our new box: 15 ke 48 cores 384G RAM 70T storage (SSD cache on RAID controller) Gets our work done faster than our 800 CPU cluster It’s the access patterns! “Nobody ever got fired for using Hadoop on a cluster” A. Rowstron et al., HotCDP ’12 G Varoquaux 30
4 The bigger picture: a community Helping your future self G Varoquaux 31
4 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors ∼ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 32
4 The economics of open source Code maintenance too expensive to be alone scikit-learn ∼ 300 email/month nipy ∼ 45 email/month joblib ∼ 45 email/month mayavi ∼ 30 email/month “Hey Gael, I take it you’re too busy. That’s okay, I spent a day trying to install XXX and I think I’ll succeed myself. Next time though please don’t ignore my emails, I really don’t like it. You can say, ‘sorry, I have no time to help you.’ Just don’t ignore.” G Varoquaux 33
4 The economics of open source Code maintenance too expensive to be alone scikit-learn ∼ 300 email/month nipy ∼ 45 email/month joblib ∼ 45 email/month mayavi ∼ 30 email/month Your “benefits” come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code... ...to avoid dying under code Code becomes less precious with time And somebody might contribute features G Varoquaux 33
4 Many eyes makes code fast L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 34
4 Communities increase the knowledge pool Even if you don’t do software, you should worry about communities More experts in the package ⇒ Easier recruitement The future is open ⇒ Enhancements are possible Meetups, conferences, are where new ideas are born G Varoquaux 35
4 6 steps to a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 36
4 Core project contributors Credit: Fernando Perez, Gist 5843625 G Varoquaux 37
4 The tragedy of the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ⇒ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 38
Simple big data, in Python Beyond the lie Machine learning gives value to (big) data Python + scikit-learn = - from interactive data processing (IPython notebook) - to crazy big problems (Python + spark) Big data will require you to understand the data-flow patterns (access, parallelism, statistics) Big data community addresses the human factor @GaelVaroquaux

Simple big data, in Python

  • 1.
    Simple big data,in Python Ga¨el Varoquaux
  • 2.
    Simple big data,in Python Ga¨el Varoquaux This is a lie!
  • 3.
    Please allow meto introduce myself I’m a man of wealth and taste I’ve been around for a long, long year Physicist gone bad Neuroscience, Machine learning Worked in a software startup Enthought: scientific computing consulting in Python Coder (done my share of mistake) Mayavi, scikit-learn, joblib... Scipy community Chair of scipy and EuroScipy conferences Researcher (PI) at INRIA G Varoquaux 2
  • 4.
    1 Machine learningin 2 words 2 Scikit-learn 3 Big data on a budget 4 The bigger picture: a community G Varoquaux 3
  • 5.
    1 Machine learningin 2 words G Varoquaux 4
  • 6.
    1 A historicalperspective Artificial Intelligence The 80s Building decision rules Eatable? Mobile? Tall? G Varoquaux 5
  • 7.
    1 A historicalperspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations G Varoquaux 5
  • 8.
    1 A historicalperspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations G Varoquaux 5
  • 9.
    1 A historicalperspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules G Varoquaux 5
  • 10.
    1 A historicalperspective Artificial Intelligence The 80s Building decision rules Machine learning The 90s Learn these from observations Statistical learning 2000s Model the noise in the observations Big data today Many observations, simple rules “Big data isn’t actually interesting without machine learning” Steve Jurvetson, VC, Silicon Valley G Varoquaux 5
  • 11.
    1 Machine learning Example:face recognition Andrew Bill Charles Dave G Varoquaux 6
  • 12.
    1 Machine learning Example:face recognition Andrew Bill Charles Dave ?G Varoquaux 6
  • 13.
    1 A simplemethod 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method G Varoquaux 7
  • 14.
    1 A simplemethod 1 Store all the known (noisy) images and the names that go with them. 2 From a new (noisy) images, find the image that is most similar. “Nearest neighbor” method How many errors on already-known images? ... 0: no erreurs Test data = Train data G Varoquaux 7
  • 15.
    1 1st problem:noise Signal unrelated to the prediction problem 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Noise level 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Predictionaccuracy G Varoquaux 8
  • 16.
    1 2nd problem:number of variables Finding a needle in a haystack 1 2 3 4 5 6 7 8 9 10 Useful fraction of the frame 0.65 0.70 0.75 0.80 0.85 0.90 0.95 Predictionaccuracy G Varoquaux 9
  • 17.
    1 Machine learning Example:face recognition Andrew Bill Charles Dave ? Learning from numerical descriptors Difficulties: i) noise, ii) number of features “supervised” task: known labels “unsupervised” task: unknown labels G Varoquaux 10
  • 18.
    1 Machine learning:regression A single descriptor: one dimension x y G Varoquaux 11
  • 19.
    1 Machine learning:regression A single descriptor: one dimension x y x y Which model to prefer? G Varoquaux 11
  • 20.
    1 Machine learning:regression A single descriptor: one dimension x y x y Problem of “over-fitting” Minimizing error is not always the best strategy (learning noise) Test data = train data G Varoquaux 11
  • 21.
    1 Machine learning:regression A single descriptor: one dimension x y x y Prefer simple models = concept of “regularization” Balance the number of parameters to learn with the amount of data G Varoquaux 11
  • 22.
    1 Machine learning:regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters G Varoquaux 11
  • 23.
    1 Machine learning:regression A single descriptor: one dimension x y Two descriptors: 2 dimensions X_1 X_2 y More parameters ⇒ need more data “curse of dimensionality” G Varoquaux 11
  • 24.
    1 Supervised learning:classification Predicting categories, e.g. numbers X2 X1 G Varoquaux 12
  • 25.
    1 Unsupervised learning Stockmarket structure G Varoquaux 13
  • 26.
    1 Unsupervised learning Stockmarket structure Unlabeled data more common than labeled data G Varoquaux 13
  • 27.
  • 28.
  • 29.
  • 30.
    1 Petty day-to-daytechnicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago G Varoquaux 16
  • 31.
    1 Petty day-to-daytechnicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago An in-house data science squad Difficulties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept G Varoquaux 16
  • 32.
    1 Petty day-to-daytechnicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago An in-house data science squad Difficulties Recruitment Limited resources (people & hardware) Risks Bus factor Technical dept We need big data (and machine learning) on a tight budget G Varoquaux 16
  • 33.
    2 Scikit-learn Machine learningwithout learning the machinery c Theodore W. GrayG Varoquaux 17
  • 34.
    2 My stack Python,what else? Interactive language Easy to read / write General purpose G Varoquaux 18
  • 35.
    2 My stack Python,what else? The scientific Python stack numpy arrays pandas ... It’s about plugin things together G Varoquaux 18
  • 36.
    2 scikit-learn vision Machinelearning for all No specific application domain No requirements in machine learning High-quality software library Interfaces designed for users Community-driven development BSD licensed, very diverse contributors http://scikit-learn.org G Varoquaux 19
  • 37.
    2 A Pythonlibrary A library, not a program More expressive and flexible Easy to include in an ecosystem As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) G Varoquaux 20
  • 38.
    2 Very richfeature set Supervised learning Decision trees (Random-Forest, Boosted Tree) Linear models SVM Unsupervised Learning Clustering Dictionary learning Outlier detection Model selection Built in cross-validation Parameter optimization G Varoquaux 21
  • 39.
    2 Computational performance scikit-learnmlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies G Varoquaux 22
  • 40.
    2 Computational performance scikit-learnmlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithmic optimizations Minimizing data copies Random Forest fit time 0 2000 4000 6000 8000 10000 12000 14000Fittime(s) 203.01 211.53 4464.65 3342.83 1518.14 1711.94 1027.91 13427.06 10941.72 Scikit-Learn-RF Scikit-Learn-ETs OpenCV-RF OpenCV-ETs OK3-RF OK3-ETs Weka-RF R-RF Orange-RF Scikit-Learn Python, Cython OpenCV C++ OK3 C Weka Java randomForest R, Fortran Orange Python Figure: Gilles Louppe G Varoquaux 22
  • 41.
    3 Big dataon a budget G Varoquaux 23
  • 42.
    3 Big dataon a budgetBiggish “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers G Varoquaux 23
  • 43.
    3 Big dataon a budgetBiggish “Big data”: Petabytes... Distributed storage Computing cluster Mere mortals: Gigabytes... Python programming Off-the-self computers Simple data processing patterns G Varoquaux 23
  • 44.
    “Big data”, butbig how? 2 scenarios: Many observations –samples e.g. twitter Many descriptors per observation –features e.g. brain scans G Varoquaux 24
  • 45.
    3 On-line algorithms Processthe data one sample at a time Compute the mean of a gazillion numbers Hard? G Varoquaux 25
  • 46.
    3 On-line algorithms Processthe data one sample at a time Compute the mean of a gazillion numbers Hard? No: just do a running mean G Varoquaux 25
  • 47.
    3 On-line algorithms Convergesto expectations Mini-batch = bunch observations for vectorization Example: K-Means clustering X = np.random.normal(size=(10 000, 200)) scipy.cluster.vq. kmeans(X, 10, iter=2) 11.33 s sklearn.cluster. MiniBatchKMeans(n clusters=10, n init=2).fit(X) 0.62 s G Varoquaux 25
  • 48.
    3 On-the-fly datareduction Big data is often I/O bound Layer memory access CPU caches RAM Local disks Distant storage Less data also means less work G Varoquaux 26
  • 49.
    3 On-the-fly datareduction Dropping data 1 loop: take a random fraction of the data 2 run algorithm on that fraction 3 aggregate results across sub-samplings Looks like bagging: bootstrap aggregation Exploits redundancy across observations Run the loop in parallel G Varoquaux 26
  • 50.
    3 On-the-fly datareduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.WardAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 26
  • 51.
    3 On-the-fly datareduction Example: randomized SVD Random projection sklearn.utils.extmath.randomized svd X = np.random.normal(size=(50000, 200)) %timeit lapack = linalg.svd(X, full matrices=False) 1 loops, best of 3: 6.09 s per loop %timeit arpack=splinalg.svds(X, 10) 1 loops, best of 3: 2.49 s per loop %timeit randomized = randomized svd(X, 10) 1 loops, best of 3: 303 ms per loop linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000 0.0022360679774997738 linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000 0.0022121161221386925 G Varoquaux 26
  • 52.
    3 Data-parallel computing Itackle only embarrassingly parallel problem Life is too short to worry about deadlocks Stratification to follow the statistical dependencies and the data storage structure Batch size scaled by the relevent cache pool - Too fine ⇒ overhead - Too coarse ⇒ memory shortage joblib.Parallel G Varoquaux 27
  • 53.
    3 Caching Minimizing data-accesslatency Never computing twice the same thing joblib.cache G Varoquaux 28
  • 54.
    3 Fast accessto data Stored representation consistent with access patterns Compression to limit bandwidth usage - CPU are faster than data access Data access speed is often more a limitation than raw processing power joblib.dump/joblib.load G Varoquaux 29
  • 55.
    3 Biggish iron Ournew box: 15 ke 48 cores 384G RAM 70T storage (SSD cache on RAID controller) Gets our work done faster than our 800 CPU cluster It’s the access patterns! “Nobody ever got fired for using Hadoop on a cluster” A. Rowstron et al., HotCDP ’12 G Varoquaux 30
  • 56.
    4 The biggerpicture: a community Helping your future self G Varoquaux 31
  • 57.
    4 Community-based developmentin scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors ∼ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 32
  • 58.
    4 The economicsof open source Code maintenance too expensive to be alone scikit-learn ∼ 300 email/month nipy ∼ 45 email/month joblib ∼ 45 email/month mayavi ∼ 30 email/month “Hey Gael, I take it you’re too busy. That’s okay, I spent a day trying to install XXX and I think I’ll succeed myself. Next time though please don’t ignore my emails, I really don’t like it. You can say, ‘sorry, I have no time to help you.’ Just don’t ignore.” G Varoquaux 33
  • 59.
    4 The economicsof open source Code maintenance too expensive to be alone scikit-learn ∼ 300 email/month nipy ∼ 45 email/month joblib ∼ 45 email/month mayavi ∼ 30 email/month Your “benefits” come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code... ...to avoid dying under code Code becomes less precious with time And somebody might contribute features G Varoquaux 33
  • 60.
    4 Many eyesmakes code fast L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 34
  • 61.
    4 Communities increasethe knowledge pool Even if you don’t do software, you should worry about communities More experts in the package ⇒ Easier recruitement The future is open ⇒ Enhancements are possible Meetups, conferences, are where new ideas are born G Varoquaux 35
  • 62.
    4 6 stepsto a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 36
  • 63.
    4 Core projectcontributors Credit: Fernando Perez, Gist 5843625 G Varoquaux 37
  • 64.
    4 The tragedyof the commons Individuals, acting independently and rationally accord- ing to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ⇒ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 38
  • 65.
    Simple big data,in Python Beyond the lie Machine learning gives value to (big) data Python + scikit-learn = - from interactive data processing (IPython notebook) - to crazy big problems (Python + spark) Big data will require you to understand the data-flow patterns (access, parallelism, statistics) Big data community addresses the human factor @GaelVaroquaux