Introduction to NumPy for Machine Learning Programmers

Introduction to NumPy for Machine Learning Programmers PyData Tokyo Meetup April 3, 2015 @ Denso IT Laboratory Kimikazu Kato Silver Egg Techonogy 1 / 26

Target Audience People who want to implement machine learning algorithms in Python. 2 / 26

Outline Preliminaries Basic usage of NumPy Indexing, Broadcasting Case study Reading source code of scikit-learn Conclusion 3 / 26

Who am I? Kimikazu Kato Chief Scientist at Silver Egg Technology Algorithm designer for a recommendation system Ph.D in computer science (Master's degree in math) 4 / 26

Python is Very Slow! Code in C #include<stdio.h> intmain(){ inti;doubles=0; for(i=1;i<=100000000;i++)s+=i; printf("%.0fn",s); } Code in Python s=0. foriinxrange(1,100000001): s+=i prints Both of the codes compute the sum of integers from 1 to 100,000,000. Result of benchmark in a certain environment: Above: 0.109 sec (compiled with -O3 option) Below: 8.657 sec (80+ times slower!!) 5 / 26

Better code importnumpyasnp a=np.arange(1,100000001) printa.sum() Now it takes 0.188 sec. (Measured by "time" command in Linux, loading time included) Still slower than C, but sufficiently fast as a script language. 6 / 26

Lessons Python is very slow when written badly Translate C (or Java, C# etc.) code into Python is often a bad idea. Python-friendly rewriting sometimes result in drastic performance improvement 7 / 26

Basic rules for better performance Avoid for-sentence as far as possible Utilize libraries' capabilities instead Forget about the cost of copying memory Typical C programmer might care about it, but ... 8 / 26

Basic techniques for NumPy Broadcasting Indexing 9 / 26

Broadcasting >>>importnumpyasnp >>>a=np.array([0,1,2,3]) >>>a*3 array([0,3,6,9]) >>>np.exp(a) array([ 1. , 2.71828183, 7.3890561, 20.08553692]) expis called a universal function. 10 / 26

Broadcasting (2D) >>>importnumpyasnp >>>a=np.arange(9).reshape((3,3)) >>>b=np.array([1,2,3]) >>>a array([[0,1,2], [3,4,5], [6,7,8]]) >>>b array([1,2,3]) >>>a*b array([[0, 2, 6], [3, 8,15], [6,14,24]]) 11 / 26

Indexing >>>importnumpyasnp >>>a=np.arange(10) >>>a array([0,1,2,3,4,5,6,7,8,9]) >>>indices=np.arange(0,10,2) >>>indices array([0,2,4,6,8]) >>>a[indices]=0 >>>a array([0,1,0,3,0,5,0,7,0,9]) >>>b=np.arange(100,600,100) >>>b array([100,200,300,400,500]) >>>a[indices]=b >>>a array([100, 1,200, 3,300, 5,400, 7,500, 9]) 12 / 26

Boolean Indexing >>>a=np.array([1,2,3]) >>>b=np.array([False,True,True]) >>>a[b] array([2,3]) >>>c=np.arange(-3,4) >>>c array([-3,-2,-1, 0, 1, 2, 3]) >>>d=c>0 >>>d array([False,False,False,False, True, True, True],dtype=bool) >>>c[d] array([1,2,3]) >>>c[c>0] array([1,2,3]) >>>c[np.logical_and(c>=0,c%2==0)] array([0,2]) >>>c[np.logical_or(c>=0,c%2==0)] array([-2, 0, 1, 2, 3]) 13 / 26

Cf. In Pandas >>>importpandasaspd >>>importnumpyasnp >>>df=pd.DataFrame(np.random.randn(5,3),columns=["A","B","C"]) >>>df A B C 0 1.084117-0.626930-1.818375 1 1.717066 2.554761-0.560069 2-1.355434-0.464632 0.322603 3 0.013824 0.298082-1.405409 4 0.743068 0.292042-1.002901 [5rowsx3columns] >>>df[df.A>0.5] A B C 0 1.084117-0.626930-1.818375 1 1.717066 2.554761-0.560069 4 0.743068 0.292042-1.002901 [3rowsx3columns] >>>df[(df.A>0.5)&(df.B>0)] A B C 1 1.717066 2.554761-0.560069 4 0.743068 0.292042-1.002901 [2rowsx3columns] 14 / 26

Case Study 1: Ridge Regression (sklearn.linear_model.Ridge) , : input, output of training data : hyper parameter The optimum is given as: The corresponding part of the code: K=safe_sparse_dot(X,X.T,dense_output=True) try: dual_coef=_solve_cholesky_kernel(K,y,alpha) coef=safe_sparse_dot(X.T,dual_coef,dense_output=True).T exceptlinalg.LinAlgError: (sklearn.h/linear_model/ridge.py L338-343) ∥y − Xw + α∥wmin w ∥ 2 2 ∥ 2 2 X y α w = ( X + αI yX T ) −1 X T 15 / 26

K=safe_sparse_dot(X,X.T,dense_output=True) try: dual_coef=_solve_cholesky_kernel(K,y,alpha) coef=safe_sparse_dot(X.T,dual_coef,dense_output=True).T exceptlinalg.LinAlgError: (sklearn.h/linear_model/ridge.py L338-343) safe_sparse_dotis a wrapper function of dotwhich can be applied to sparse and dense matrices. _solve_cholesky_kernelcomputes w = ( X + α yX T ) −1 X T (K + αI y) −1 16 / 26

Inside _solve_cholesky_kernel K.flat[::n_samples+1]+=alpha[0] try: dual_coef=linalg.solve(K,y,sym_pos=True, overwrite_a=False) exceptnp.linalg.LinAlgError: (sklearn.h/linear_model/ridge.py L138-146, comments omitted) invshould not be used; solveis faster (general knowledge in numerical computation) flat??? (K + αI y) −1 17 / 26

flat classflatiter(builtins.object) | Flatiteratorobjecttoiterateoverarrays. | | Aflatiteriteratorisreturnedby``x.flat``foranyarrayx. | Itallowsiteratingoverthearrayasifitwerea1-Darray, | eitherinafor-looporbycallingitsnextmethod. | | IterationisdoneinC-contiguousstyle,withthelastindexvaryingthe | fastest.Theiteratorcanalsobeindexedusingbasicslicingor | advancedindexing. | | SeeAlso | -------- | ndarray.flat:Returnaflatiteratoroveranarray. | ndarray.flatten:Returnsaflattenedcopyofanarray. | | Notes | ----- | AflatiteriteratorcannotbeconstructeddirectlyfromPythoncode | bycallingtheflatiterconstructor. In short, x.flatis a reference to the elements of the array x, and can be used like a one dimensional array. 18 / 26

K.flat[::n_samples+1]+=alpha[0] try: dual_coef=linalg.solve(K,y,sym_pos=True, overwrite_a=False) exceptnp.linalg.LinAlgError: (sklearn.h/linear_model/ridge.py L138-146, comments omitted) K.flat[::n_samples+1]+=alpha[0] is equivalent to K+=alpha[0]*np.eyes(n_samples) (The size of is n_samples n_samples) The upper is an inplace version. K × 19 / 26

Case Study 2: NMF (sklearn.decomposition.nmf) NMF = Non-negative Matrix Factorization Successful in face part detection 20 / 26

Idea of NMF Approximate the input matrix as a product of two smaller non-negative matrix: Notation Parameter set: Error function: X ≈ HW T ≥ 0, ≥ 0Wij Hij Θ = (W , H), : i-th element of Θθi f(Θ) = ∥X − HW T ∥ 2 F 21 / 26

Algorithm of NMF Projected gradient descent (Lin 2007): where Convergence condition: where (Note: ) = P [ − α∇f( )]Θ (k+1) Θ (k) Θ (k) P [x = max(0, )]i xi f( ) ≤ ϵ f( )∥ ∥∇ P Θ (k) ∥ ∥ ∥ ∥∇ P Θ (1) ∥ ∥ f(Θ) = {∇ P ∇f(Θ)i min (0, ∇f(Θ ))i if > 0θi if = 0θi ≥ 0θi 22 / 26

Computation of where Code: proj_norm=norm(np.r_[gradW[np.logical_or(gradW<0,W>0)], gradH[np.logical_or(gradH<0,H>0)]]) (sklearn/decomposition/nmf.py L500-501) norm: utility function of scikit-learn which computes L2-norm np.r_: concatenation of arrays f(Θ)∥∥∇ P ∣∣ f(Θ) = {∇ P ∇f(Θ)i min (0, ∇f(Θ ))i if > 0θi if = 0θi 23 / 26

means Code: proj_norm=norm(np.r_[gradW[np.logical_or(gradW<0,W>0)], gradH[np.logical_or(gradH<0,H>0)]]) (sklearn/decomposition/nmf.py L500-501) gradW[np.logical_or(gradW<0,W>0)], means that an element is employed if or , and discarded otherwise. Only non-zero elements remains after indexing. f(Θ) = {∇ P ∇f(Θ)i min (0, ∇f(Θ ))i if > 0θi if = 0θi f(Θ) =∇ P ⎧ ⎩ ⎨ ⎪ ⎪ ∇f(Θ)i ∇f(Θ)i 0 if > 0θi if = 0 and ∇f(Θ < 0θi )i otherwise ∇f(Θ < 0)i > 0θi 24 / 26

Conclusion Avoid for-sentence; use NumPy/SciPy's capabilities Mathematical derivation is important You can learn a lot from the source code of scikit-learn 25 / 26

References Official scikit-learn For beginners of NumPy/SciPy Gabriele Lanaro, "Python High Performance Programming," Packt Publishing, 2013. Stéfan van der Walt, Numpy Medkit Python Scientific Lecture Notes Algorithm of NMF C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation 19, 2007. 26 / 26

Introduction to NumPy for Machine Learning Programmers

In this document

More Related Content

What's hot

Viewers also liked

Similar to Introduction to NumPy for Machine Learning Programmers

More from Kimikazu Kato

Recently uploaded

Introduction to NumPy for Machine Learning Programmers