Python for brain mining: (neuro)science with state of the art machine learning and data visualization Ga¨l Varoquaux e 1. Data-driven science “Brain mining” 2. Data mining in Python Mayavi, scikit-learn, joblib
1 Brain mining Learning models of brain function Ga¨l Varoquaux e 2
1 Imaging neuroscience Brain Models of images function Cognitive tasks Ga¨l Varoquaux e 3
1 Imaging neuroscience Brain Models of images function Data-driven science ∂ i Cognitive= HΨ Ψ tasks ∂t Ga¨l Varoquaux e 3
1 Brain functional data Rich data 50 000 voxels per frame Complex underlying dynamics Few observations ∼ 100 Drawing scientific conclusions? Ill-posed statistical problem Ga¨l Varoquaux e 4
1 Brain functional data Rich data 50 000 voxels per frame Modern complex system studies: Complex underlying from strong hypothesizes to rich data dynamics Few observations ∼ 100 Drawing scientific conclusions? Ill-posed statistical problem Ga¨l Varoquaux e 4
1 Statistics: the curse of dimensionality y function of x1 Ga¨l Varoquaux e 5
1 Statistics: the curse of dimensionality y function of x1 y function of x1 and x2 More fit parameters? ⇒ need exponentially more data Ga¨l Varoquaux e 5
1 Statistics: the curse of dimensionality y function of x1 y function of x1 and x2 More fit parameters? ⇒ need exponentially more data y function of 50 000 voxels? Expert knowledge Machine learning (pick the right ones) Ga¨l Varoquaux e 5
1 Brain reading Predict from brain images the object viewed Correlation analysis Ga¨l Varoquaux e 6
1 Brain reading Predict from brain images the object viewed Inverse problem Inject prior: regularize Observations Spatial code Sparse regression Correlation analysis = compressive sensing Ga¨l Varoquaux e 6
1 Brain reading Predict from brain images the object viewed Inverse problem Inject prior: regularize Extract brain regions Observations Spatial code Total variation Correlation analysis regression [Michel, Trans Med Imag 2011] Ga¨l Varoquaux e 6
1 Brain reading Predict from brain images the object viewed Inverse problem Inject prior: regularize Cast the problem in Extract brain regions a prediction task: Observations supervised Total variation Spatial code learning. Correlation analysis a model-selection metric Prediction is regression [Michel, Trans Med Imag 2011] Ga¨l Varoquaux e 6
1 On-going/spontaneous activity 95% of the activity is unrelated to task Ga¨l Varoquaux e 7
1 Learning regions from spontaneous activity Spatial Time series map Multi-subject dictionary learning Sparsity + spatial continuity + spatial variability ⇒ Individual maps + functional regions atlas [Varoquaux, Inf Proc Med Imag 2011] Ga¨l Varoquaux e 8
1 Graphical models: interactions between regions Estimate covariance structure Many parameters to learn Regularize: conditional independence = sparsity on inverse covariance [Varoquaux NIPS 2010] Ga¨l Varoquaux e 9
1 Graphical models: interactions between regions Estimate covariance structure Many parameters to learn Regularize: conditional Find structure via a density estimation: independence unsupervised learning. = sparsity on inverse Model selection: likelihood of new data covariance [Varoquaux NIPS 2010] Ga¨l Varoquaux e 9
2 My data-science software stack Mayavi, scikit-learn, joblib Ga¨l Varoquaux e 10
2 Mayavi: 3D data visualization Requirements Solution large 3D data VTK: C++ data visualization interactive visualization UI (traits) easy scripting + pylab-inspired API Black-box solutions don’t yield new intuitions Limitations hard to install Tragedy of the clunky & complex commons or C++ leaking through niche product? 3D visualization doesn’t pay in academia Ga¨l Varoquaux e 11
2 scikit-learn: statistical learning Vision Address non-machine-learning experts Simplify but don’t dumb down Performance: be state of the art Ease of installation Ga¨l Varoquaux e 12
2 scikit-learn: statistical learning Technical choices Prefer Python or Cython, focus on readability Documentation and examples are paramount Little object-oriented design. Opt for simplicity Prefer algorithms to framework Code quality: consistency and testing Ga¨l Varoquaux e 13
2 scikit-learn: statistical learning API Inputs are numpy arrays Learn a model from the data: estimator.fit(X train, Y train) Predict using learned model estimator.predict(X test) Test goodness of fit estimator.score(X test, y test) Apply change of representation estimator.transform(X, y) Ga¨l Varoquaux e 14
2 scikit-learn: statistical learning Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithms rather than low-level optimization convex optimization + machine learning Avoid memory copies Ga¨l Varoquaux e 15
2 scikit-learn: statistical learning Community 35 contributors since 2008, 103 github forks 25 contributors in latest release (3 months span) Why this success? Trendy topic? Low barrier of entry Friendly and very skilled mailing list Credit to people Ga¨l Varoquaux e 16
2 joblib: Python functions on steroids We keep recomputing the same things Nested loops with overlapping sub-problems Varying parameters I/O Standard solution: pipelines Challenges Dependencies modeling Parameter tracking Ga¨l Varoquaux e 17
2 joblib: Python functions on steroids Philosophy Simple don’t change your code Minimal no dependencies Performant big data Robust never fail joblib’s solution = lazy recomputation: Take an MD5 hash of function arguments, Store outputs to disk Ga¨l Varoquaux e 18
2 joblib Lazy recomputing >>> from j o b l i b import Memory >>> mem = Memory ( c a c h e d i r = ’/ tmp / joblib ’) >>> import numpy a s np >>> a = np . v a n d e r ( np . a r a n g e (3) ) >>> s q u a r e = mem. c a c h e ( np . s q u a r e ) >>> b = square (a) [ Memory ] C a l l i n g s q u a r e ... s q u a r e ( a r r a y ([[0 , 0 , 1] , [1 , 1 , 1] , [4 , 2 , 1]]) ) s q u a r e - 0.0 s >>> c = s q u a r e ( a ) >>> # No recomputation Ga¨l Varoquaux e 19
Conclusion Data-driven science will need machine learning because of the curse of dimensionality Scikit-learn and joblib: focus on large-data performance and easy of use Cannot develop software and science separately Ga¨l Varoquaux e 20

Python for brain mining: (neuro)science with state of the art machine learning and data visualization

  • 1.
    Python for brainmining: (neuro)science with state of the art machine learning and data visualization Ga¨l Varoquaux e 1. Data-driven science “Brain mining” 2. Data mining in Python Mayavi, scikit-learn, joblib
  • 2.
    1 Brain mining Learning models of brain function Ga¨l Varoquaux e 2
  • 3.
    1 Imaging neuroscience Brain Models of images function Cognitive tasks Ga¨l Varoquaux e 3
  • 4.
    1 Imaging neuroscience Brain Models of images function Data-driven science ∂ i Cognitive= HΨ Ψ tasks ∂t Ga¨l Varoquaux e 3
  • 5.
    1 Brain functionaldata Rich data 50 000 voxels per frame Complex underlying dynamics Few observations ∼ 100 Drawing scientific conclusions? Ill-posed statistical problem Ga¨l Varoquaux e 4
  • 6.
    1 Brain functionaldata Rich data 50 000 voxels per frame Modern complex system studies: Complex underlying from strong hypothesizes to rich data dynamics Few observations ∼ 100 Drawing scientific conclusions? Ill-posed statistical problem Ga¨l Varoquaux e 4
  • 7.
    1 Statistics: thecurse of dimensionality y function of x1 Ga¨l Varoquaux e 5
  • 8.
    1 Statistics: thecurse of dimensionality y function of x1 y function of x1 and x2 More fit parameters? ⇒ need exponentially more data Ga¨l Varoquaux e 5
  • 9.
    1 Statistics: thecurse of dimensionality y function of x1 y function of x1 and x2 More fit parameters? ⇒ need exponentially more data y function of 50 000 voxels? Expert knowledge Machine learning (pick the right ones) Ga¨l Varoquaux e 5
  • 10.
    1 Brain reading Predict from brain images the object viewed Correlation analysis Ga¨l Varoquaux e 6
  • 11.
    1 Brain reading Predict from brain images the object viewed Inverse problem Inject prior: regularize Observations Spatial code Sparse regression Correlation analysis = compressive sensing Ga¨l Varoquaux e 6
  • 12.
    1 Brain reading Predict from brain images the object viewed Inverse problem Inject prior: regularize Extract brain regions Observations Spatial code Total variation Correlation analysis regression [Michel, Trans Med Imag 2011] Ga¨l Varoquaux e 6
  • 13.
    1 Brain reading Predict from brain images the object viewed Inverse problem Inject prior: regularize Cast the problem in Extract brain regions a prediction task: Observations supervised Total variation Spatial code learning. Correlation analysis a model-selection metric Prediction is regression [Michel, Trans Med Imag 2011] Ga¨l Varoquaux e 6
  • 14.
    1 On-going/spontaneous activity 95% of the activity is unrelated to task Ga¨l Varoquaux e 7
  • 15.
    1 Learning regionsfrom spontaneous activity Spatial Time series map Multi-subject dictionary learning Sparsity + spatial continuity + spatial variability ⇒ Individual maps + functional regions atlas [Varoquaux, Inf Proc Med Imag 2011] Ga¨l Varoquaux e 8
  • 16.
    1 Graphical models:interactions between regions Estimate covariance structure Many parameters to learn Regularize: conditional independence = sparsity on inverse covariance [Varoquaux NIPS 2010] Ga¨l Varoquaux e 9
  • 17.
    1 Graphical models:interactions between regions Estimate covariance structure Many parameters to learn Regularize: conditional Find structure via a density estimation: independence unsupervised learning. = sparsity on inverse Model selection: likelihood of new data covariance [Varoquaux NIPS 2010] Ga¨l Varoquaux e 9
  • 18.
    2 My data-sciencesoftware stack Mayavi, scikit-learn, joblib Ga¨l Varoquaux e 10
  • 19.
    2 Mayavi: 3Ddata visualization Requirements Solution large 3D data VTK: C++ data visualization interactive visualization UI (traits) easy scripting + pylab-inspired API Black-box solutions don’t yield new intuitions Limitations hard to install Tragedy of the clunky & complex commons or C++ leaking through niche product? 3D visualization doesn’t pay in academia Ga¨l Varoquaux e 11
  • 20.
    2 scikit-learn: statisticallearning Vision Address non-machine-learning experts Simplify but don’t dumb down Performance: be state of the art Ease of installation Ga¨l Varoquaux e 12
  • 21.
    2 scikit-learn: statisticallearning Technical choices Prefer Python or Cython, focus on readability Documentation and examples are paramount Little object-oriented design. Opt for simplicity Prefer algorithms to framework Code quality: consistency and testing Ga¨l Varoquaux e 13
  • 22.
    2 scikit-learn: statisticallearning API Inputs are numpy arrays Learn a model from the data: estimator.fit(X train, Y train) Predict using learned model estimator.predict(X test) Test goodness of fit estimator.score(X test, y test) Apply change of representation estimator.transform(X, y) Ga¨l Varoquaux e 14
  • 23.
    2 scikit-learn: statisticallearning Computational performance scikit-learn mlpy pybrain pymvpa mdp shogun SVM 5.2 9.47 17.5 11.52 40.48 5.63 LARS 1.17 105.3 - 37.35 - - Elastic Net 0.52 73.7 - 1.44 - - kNN 0.57 1.41 - 0.56 0.58 1.36 PCA 0.18 - - 8.93 0.47 0.33 k-Means 1.34 0.79 ∞ - 35.75 0.68 Algorithms rather than low-level optimization convex optimization + machine learning Avoid memory copies Ga¨l Varoquaux e 15
  • 24.
    2 scikit-learn: statisticallearning Community 35 contributors since 2008, 103 github forks 25 contributors in latest release (3 months span) Why this success? Trendy topic? Low barrier of entry Friendly and very skilled mailing list Credit to people Ga¨l Varoquaux e 16
  • 25.
    2 joblib: Pythonfunctions on steroids We keep recomputing the same things Nested loops with overlapping sub-problems Varying parameters I/O Standard solution: pipelines Challenges Dependencies modeling Parameter tracking Ga¨l Varoquaux e 17
  • 26.
    2 joblib: Pythonfunctions on steroids Philosophy Simple don’t change your code Minimal no dependencies Performant big data Robust never fail joblib’s solution = lazy recomputation: Take an MD5 hash of function arguments, Store outputs to disk Ga¨l Varoquaux e 18
  • 27.
    2 joblib Lazy recomputing >>> from j o b l i b import Memory >>> mem = Memory ( c a c h e d i r = ’/ tmp / joblib ’) >>> import numpy a s np >>> a = np . v a n d e r ( np . a r a n g e (3) ) >>> s q u a r e = mem. c a c h e ( np . s q u a r e ) >>> b = square (a) [ Memory ] C a l l i n g s q u a r e ... s q u a r e ( a r r a y ([[0 , 0 , 1] , [1 , 1 , 1] , [4 , 2 , 1]]) ) s q u a r e - 0.0 s >>> c = s q u a r e ( a ) >>> # No recomputation Ga¨l Varoquaux e 19
  • 28.
    Conclusion Data-driven science will need machine learning because of the curse of dimensionality Scikit-learn and joblib: focus on large-data performance and easy of use Cannot develop software and science separately Ga¨l Varoquaux e 20