Introduction to NumPy for Machine Learning Programmers PyData Tokyo Meetup April 3, 2015 @ Denso IT Laboratory Kimikazu Kato Silver Egg Techonogy 1 / 26
Target Audience People who want to implement machine learning algorithms in Python. 2 / 26
Outline Preliminaries Basic usage of NumPy Indexing, Broadcasting Case study Reading source code of scikit-learn Conclusion 3 / 26
Who am I? Kimikazu Kato Chief Scientist at Silver Egg Technology Algorithm designer for a recommendation system Ph.D in computer science (Master's degree in math) 4 / 26
Python is Very Slow! Code in C #include<stdio.h> intmain(){ inti;doubles=0; for(i=1;i<=100000000;i++)s+=i; printf("%.0fn",s); } Code in Python s=0. foriinxrange(1,100000001): s+=i prints Both of the codes compute the sum of integers from 1 to 100,000,000. Result of benchmark in a certain environment: Above: 0.109 sec (compiled with -O3 option) Below: 8.657 sec (80+ times slower!!) 5 / 26
Better code importnumpyasnp a=np.arange(1,100000001) printa.sum() Now it takes 0.188 sec. (Measured by "time" command in Linux, loading time included) Still slower than C, but sufficiently fast as a script language. 6 / 26
Lessons Python is very slow when written badly Translate C (or Java, C# etc.) code into Python is often a bad idea. Python-friendly rewriting sometimes result in drastic performance improvement 7 / 26
Basic rules for better performance Avoid for-sentence as far as possible Utilize libraries' capabilities instead Forget about the cost of copying memory Typical C programmer might care about it, but ... 8 / 26
Basic techniques for NumPy Broadcasting Indexing 9 / 26
Broadcasting >>>importnumpyasnp >>>a=np.array([0,1,2,3]) >>>a*3 array([0,3,6,9]) >>>np.exp(a) array([ 1. , 2.71828183, 7.3890561, 20.08553692]) expis called a universal function. 10 / 26
Broadcasting (2D) >>>importnumpyasnp >>>a=np.arange(9).reshape((3,3)) >>>b=np.array([1,2,3]) >>>a array([[0,1,2], [3,4,5], [6,7,8]]) >>>b array([1,2,3]) >>>a*b array([[0, 2, 6], [3, 8,15], [6,14,24]]) 11 / 26
Indexing >>>importnumpyasnp >>>a=np.arange(10) >>>a array([0,1,2,3,4,5,6,7,8,9]) >>>indices=np.arange(0,10,2) >>>indices array([0,2,4,6,8]) >>>a[indices]=0 >>>a array([0,1,0,3,0,5,0,7,0,9]) >>>b=np.arange(100,600,100) >>>b array([100,200,300,400,500]) >>>a[indices]=b >>>a array([100, 1,200, 3,300, 5,400, 7,500, 9]) 12 / 26
Boolean Indexing >>>a=np.array([1,2,3]) >>>b=np.array([False,True,True]) >>>a[b] array([2,3]) >>>c=np.arange(-3,4) >>>c array([-3,-2,-1, 0, 1, 2, 3]) >>>d=c>0 >>>d array([False,False,False,False, True, True, True],dtype=bool) >>>c[d] array([1,2,3]) >>>c[c>0] array([1,2,3]) >>>c[np.logical_and(c>=0,c%2==0)] array([0,2]) >>>c[np.logical_or(c>=0,c%2==0)] array([-2, 0, 1, 2, 3]) 13 / 26
Cf. In Pandas >>>importpandasaspd >>>importnumpyasnp >>>df=pd.DataFrame(np.random.randn(5,3),columns=["A","B","C"]) >>>df A B C 0 1.084117-0.626930-1.818375 1 1.717066 2.554761-0.560069 2-1.355434-0.464632 0.322603 3 0.013824 0.298082-1.405409 4 0.743068 0.292042-1.002901 [5rowsx3columns] >>>df[df.A>0.5] A B C 0 1.084117-0.626930-1.818375 1 1.717066 2.554761-0.560069 4 0.743068 0.292042-1.002901 [3rowsx3columns] >>>df[(df.A>0.5)&(df.B>0)] A B C 1 1.717066 2.554761-0.560069 4 0.743068 0.292042-1.002901 [2rowsx3columns] 14 / 26
Case Study 1: Ridge Regression (sklearn.linear_model.Ridge) , : input, output of training data : hyper parameter The optimum is given as: The corresponding part of the code: K=safe_sparse_dot(X,X.T,dense_output=True) try: dual_coef=_solve_cholesky_kernel(K,y,alpha) coef=safe_sparse_dot(X.T,dual_coef,dense_output=True).T exceptlinalg.LinAlgError: (sklearn.h/linear_model/ridge.py L338-343) ∥y − Xw + α∥wmin w ∥ 2 2 ∥ 2 2 X y α w = ( X + αI yX T ) −1 X T 15 / 26
K=safe_sparse_dot(X,X.T,dense_output=True) try: dual_coef=_solve_cholesky_kernel(K,y,alpha) coef=safe_sparse_dot(X.T,dual_coef,dense_output=True).T exceptlinalg.LinAlgError: (sklearn.h/linear_model/ridge.py L338-343) safe_sparse_dotis a wrapper function of dotwhich can be applied to sparse and dense matrices. _solve_cholesky_kernelcomputes w = ( X + α yX T ) −1 X T (K + αI y) −1 16 / 26
Inside _solve_cholesky_kernel K.flat[::n_samples+1]+=alpha[0] try: dual_coef=linalg.solve(K,y,sym_pos=True, overwrite_a=False) exceptnp.linalg.LinAlgError: (sklearn.h/linear_model/ridge.py L138-146, comments omitted) invshould not be used; solveis faster (general knowledge in numerical computation) flat??? (K + αI y) −1 17 / 26
flat classflatiter(builtins.object) | Flatiteratorobjecttoiterateoverarrays. | | Aflatiteriteratorisreturnedby``x.flat``foranyarrayx. | Itallowsiteratingoverthearrayasifitwerea1-Darray, | eitherinafor-looporbycallingitsnextmethod. | | IterationisdoneinC-contiguousstyle,withthelastindexvaryingthe | fastest.Theiteratorcanalsobeindexedusingbasicslicingor | advancedindexing. | | SeeAlso | -------- | ndarray.flat:Returnaflatiteratoroveranarray. | ndarray.flatten:Returnsaflattenedcopyofanarray. | | Notes | ----- | AflatiteriteratorcannotbeconstructeddirectlyfromPythoncode | bycallingtheflatiterconstructor. In short, x.flatis a reference to the elements of the array x, and can be used like a one dimensional array. 18 / 26
K.flat[::n_samples+1]+=alpha[0] try: dual_coef=linalg.solve(K,y,sym_pos=True, overwrite_a=False) exceptnp.linalg.LinAlgError: (sklearn.h/linear_model/ridge.py L138-146, comments omitted) K.flat[::n_samples+1]+=alpha[0] is equivalent to K+=alpha[0]*np.eyes(n_samples) (The size of is n_samples n_samples) The upper is an inplace version. K × 19 / 26
Case Study 2: NMF (sklearn.decomposition.nmf) NMF = Non-negative Matrix Factorization Successful in face part detection 20 / 26
Idea of NMF Approximate the input matrix as a product of two smaller non-negative matrix: Notation Parameter set: Error function: X ≈ HW T ≥ 0,   ≥ 0Wij Hij Θ = (W , H),  : i-th element of Θθi f(Θ) = ∥X − HW T ∥ 2 F 21 / 26
Algorithm of NMF Projected gradient descent (Lin 2007): where Convergence condition: where (Note: ) = P [ − α∇f( )]Θ (k+1) Θ (k) Θ (k) P [x = max(0, )]i xi f( ) ≤ ϵ f( )∥ ∥∇ P Θ (k) ∥ ∥ ∥ ∥∇ P Θ (1) ∥ ∥ f(Θ) = {∇ P ∇f(Θ)i min (0, ∇f(Θ ))i if  > 0θi if  = 0θi ≥ 0θi 22 / 26
Computation of where Code: proj_norm=norm(np.r_[gradW[np.logical_or(gradW<0,W>0)], gradH[np.logical_or(gradH<0,H>0)]]) (sklearn/decomposition/nmf.py L500-501) norm: utility function of scikit-learn which computes L2-norm np.r_: concatenation of arrays f(Θ)∥∥∇ P ∣∣ f(Θ) = {∇ P ∇f(Θ)i min (0, ∇f(Θ ))i if  > 0θi if  = 0θi 23 / 26
means Code: proj_norm=norm(np.r_[gradW[np.logical_or(gradW<0,W>0)], gradH[np.logical_or(gradH<0,H>0)]]) (sklearn/decomposition/nmf.py L500-501) gradW[np.logical_or(gradW<0,W>0)], means that an element is employed if or , and discarded otherwise. Only non-zero elements remains after indexing. f(Θ) = {∇ P ∇f(Θ)i min (0, ∇f(Θ ))i if  > 0θi if  = 0θi f(Θ) =∇ P ⎧ ⎩ ⎨ ⎪ ⎪ ∇f(Θ)i ∇f(Θ)i 0 if  > 0θi if  = 0 and ∇f(Θ < 0θi )i otherwise ∇f(Θ < 0)i > 0θi 24 / 26
Conclusion Avoid for-sentence; use NumPy/SciPy's capabilities Mathematical derivation is important You can learn a lot from the source code of scikit-learn 25 / 26
References Official scikit-learn For beginners of NumPy/SciPy Gabriele Lanaro, "Python High Performance Programming," Packt Publishing, 2013. Stéfan van der Walt, Numpy Medkit Python Scientific Lecture Notes Algorithm of NMF C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation 19, 2007. 26 / 26

Introduction to NumPy for Machine Learning Programmers