Efficient Implementation of Self-Organizing Map for Sparse Input Data Josué Melka and Jean-Jacques Mariage
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM 1 Introduction SOM Description Standard vs. Batch Versions Motivation for Sparse SOM 2 Contributions 3 Experiments
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM Presentation: The SOM Network Self-Organizing Map (Kohonen 1982): an artificial neural network trained by unsupervised competitive learning produces a low-dimensional map of the input space Many applications Commonly used for data projection, clustering, etc.
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM The SOM Map The map: units (= nodes/neurons) on a lattice. Associated with each node: a weight vector a position in the map http://www.lohninger.com
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM Example: MNIST Handwritten Dataset Random sampling SOM map (12x16 units)
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM The SOM Training Algorithm Within the main loop: 1 compute the distance between input and weight vectors dk(t) = x(t) − wk(t) 2 (1) 2 find the node weight closest to the input (BMU) dc(t) = min k d(t) (2) 3 update the BMU and its neighbors to be closer to the input
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM Standard Algorithm: The Learning Rule Update weight vectors at each time step, for a random sample x: wk(t + 1) = wk(t) + α(t)hck(t) [x(t) − wk(t)] (3) α(t) is the decreasing learning rate hck(t) is the neighborhood function For example, the Gaussian: hck(t) = exp − rk − rc 2 2σ(t)2 Gaussian neighborhood
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM The Batch Algorithm The BMUs and their neighbors are updated once at the end of each epoch, with the average of all the samples that trigger them: wk(tf ) = tf t0 hck(t )x(t ) tf t0 hck(t ) (4)
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM Motivation: Reduce the Computing Time Computing time depends on: 1 the T training iterations 2 the M nodes in the network 3 the D dimensions of the vectors The issue: large sparse datasets Many real-world datasets are sparse and high-dimensional. But existing SOMs can’t exploit sparseness to save time.
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM Datasets Examples: Dense vs. Sparse MNIST dataset (dim: 780, density: 19.22%) News-20 dataset (dim: 62061, density: 0.13%)
Introduction Contributions Experiments SOM Description Standard vs. Batch Versions Motivation for Sparse SOM Overcoming the Dimensionality Problem A popular option: reduce the model space Use less dimension-sensitive space reduction techniques (such as SVD, Random-Mapping, etc.). But is this the only way ? Let’s define f is the fraction of non-zero values, and d = D × f . Can we reduce the SOM complexity from O(TMD) to O(TMd) ?
Introduction Contributions Experiments Sparse Algorithms Parallelism 1 Introduction 2 Contributions Sparse Algorithms Parallelism 3 Experiments
Introduction Contributions Experiments Sparse Algorithms Parallelism Compressed Sparse Rows Format https://op2.github.io/PyOP2/linear_algebra.html Sufficient to reduce the computing time ? CSR supports efficient linear algebra operations. But: nodes weights must be dense for training not all operations produce sparse output
Introduction Contributions Experiments Sparse Algorithms Parallelism Speedup the BMU Search Euclidean distance (1) is equivalent to: dk(t) = x(t) 2 − 2(wk(t) · x(t)) + wk(t) 2 (5) Batch version This change suffices to make sparse Batch SOM efficient, since wk 2 can be computed once for each epoch. Standard version By storing w(t) 2, we can compute w(t + 1) 2 efficiently. wk(t + 1) 2 = (1 − β(t))2 wk(t) 2 + β(t)2 x(t) 2 + 2β(t)(1 − β(t))(wk(t) · x(t)) (6)
Introduction Contributions Experiments Sparse Algorithms Parallelism Sparse-Som: Speedup the Update Phase We express the learning rule (3) as (Natarajan 1997): wk(t + 1) = (1 − β(t)) wk(t) + β(t) 1 − β(t) x(t) (7) Don’t update entire weight vectors We keep the scalar coefficient separately, so we update only the values affected by x(t). Numerical stability To avoid numerical stability issues, we use double-precision floating-point, and rescale the weights when needed.
Introduction Contributions Experiments Sparse Algorithms Parallelism Parallel Approaches for SOM Specialized hardware (historical) Massively parallel computing Shared-memory multiprocessing http://www.nersc.gov/users/computational-systems/cori
Introduction Contributions Experiments Sparse Algorithms Parallelism How to Split the Computation Network partitioning divides the neurons Data partitioning dispatches the input data
Introduction Contributions Experiments Sparse Algorithms Parallelism What We Did Sparse-Som: hard to parallelize not adapted to data partitioning too much latency with network partitioning Sparse-BSom: much simpler adapted both to data and network partitioning less synchronization needed Another specific issue due to sparseness Memory-access latency, because the non-linear access pattern to weight vectors. Mitigation: improve the processor cache locality Access to the weight-vectors in the inner-loop.
Introduction Contributions Experiments Evaluation Speed benchmark Quality test 1 Introduction 2 Contributions 3 Experiments Evaluation Speed benchmark Quality test
Introduction Contributions Experiments Evaluation Speed benchmark Quality test The Evaluation We’ve trained SOM networks on various datasets with same parameters 5 times each test then measured their performance (speed and quality). Our speed baseline Somoclu (Wittek et al. 2017) is a massively parallel batch SOM implementation, which uses the classical algorithm.
Introduction Contributions Experiments Evaluation Speed benchmark Quality test Datasets Characteristics classes features samples density 15564 0.14 rcv1 53 47236 518571 0.14 15933 0.13 news20 20 62061 3993 0.13 6412 0.29 sector 105 55197 3207 0.30 60000 19.22 mnist 10 780 10000 19.37 7291 100.00 usps 10 256 2007 100.00 15000 100.00 letter 26 16 5000 100.00 17766 29.00 protein 3 357 6621 26.06 2000 25.34 dna 3 180 1186 25.14 4435 98.99 satimage 6 36 2000 98.96
Introduction Contributions Experiments Evaluation Speed benchmark Quality test Training Parameters Each network : 30 × 40 units grid Rectangular lattice tmax = 10 × Nsamples / Kepochs = 10 α(t) = 1 − (t/tmax) Gaussian neighborhood, with a radius decreasing linearly from 15 to 0.5
Introduction Contributions Experiments Evaluation Speed benchmark Quality test Speed Benchmark 4 datasets used (2 sparse and 2 dense), to test both: Serial mode (Sparse-Som vs. Sparse-BSom) Parallel mode (Sparse-BSom vs. Somoclu) Hardware and system specifications: Intel Xeon E5-4610 4 sockets of 6 cores each cadenced at 2.4 GHz 2 threads / core Linux Ubuntu 16.04 (64 bits) GCC 5.4
Introduction Contributions Experiments Evaluation Speed benchmark Quality test Results: Serial Performance usps mnist sector news20 rcv1 0 50 100 150 200 250 300 elapsedtime(seconds) 30 203 94 144 123 50 302 34 44 36 Sparse-Som Sparse-BSom
Introduction Contributions Experiments Evaluation Speed benchmark Quality test Results: Parallel Performance 1 2 4 8 16 32 number of CPU cores 101 102 103 104 elapsedtime(seconds) Somoclu sector news20 mnist usps 1 2 4 8 16 32 number of CPU cores Sparse-BSom sector news20 mnist usps
Introduction Contributions Experiments Evaluation Speed benchmark Quality test Quality Evaluation: Methodology Metrics used: 1 Average Quantization Error Q = N i=1 xi − wc N (8) 2 Topological Error 3 Precision and Recall in classification
Introduction Contributions Experiments Evaluation Speed benchmark Quality test Results: Average Quantization Error Sparse-Som Sparse-BSom rcv1 0.825 ± 0.001 0.816 ± 0.001 news20 0.905 ± 0.000 0.901 ± 0.001 sector 0.814 ± 0.001 0.772 ± 0.003 mnist 4.400 ± 0.001 4.500 ± 0.008 usps 3.333 ± 0.002 3.086 ± 0.006 protein 2.451 ± 0.000 2.450 ± 0.001 dna 4.452 ± 0.006 3.267 ± 0.042 satimage 0.439 ± 0.001 0.377 ± 0.001 letter 0.357 ± 0.001 0.345 ± 0.002
Introduction Contributions Experiments Evaluation Speed benchmark Quality test Results: Topological Error Sparse-Som Sparse-BSom rcv1 0.248 ± 0.007 0.353 ± 0.010 news20 0.456 ± 0.020 0.604 ± 0.014 sector 0.212 ± 0.012 0.514 ± 0.017 mnist 0.369 ± 0.005 0.268 ± 0.003 usps 0.150 ± 0.006 0.281 ± 0.011 protein 0.505 ± 0.006 0.448 ± 0.007 dna 0.099 ± 0.004 0.278 ± 0.023 satimage 0.103 ± 0.005 0.239 ± 0.015 letter 0.160 ± 0.003 0.269 ± 0.008
Introduction Contributions Experiments Evaluation Speed benchmark Quality test Results: Prediction Evaluation Sparse-Som Sparse-BSom precision recall precision recall rcv1 79.2 ± 0.5 79.3 ± 0.6 81.3 ± 0.4 82.1 ± 0.3 73.7 ± 0.4 70.6 ± 0.5 76.6 ± 0.4 72.6 ± 0.5 news20 64.2 ± 0.5 62.8 ± 0.5 50.3 ± 0.9 49.6 ± 0.8 60.0 ± 1.7 55.4 ± 1.3 47.8 ± 1.2 43.6 ± 1.2 sector 77.2 ± 0.9 73.2 ± 0.9 58.4 ± 0.5 56.0 ± 1.0 73.3 ± 0.8 61.3 ± 1.8 60.9 ± 1.3 44.8 ± 1.0 mnist 93.5 ± 0.2 93.5 ± 0.2 91.5 ± 0.2 91.5 ± 0.2 93.4 ± 0.2 93.4 ± 0.2 91.7 ± 0.2 91.7 ± 0.2 usps 95.9 ± 0.2 95.9 ± 0.2 95.6 ± 0.2 95.6 ± 0.2 91.4 ± 0.3 90.7 ± 0.3 92.4 ± 0.5 91.5 ± 0.4 protein 56.7 ± 0.2 57.5 ± 0.2 56.7 ± 0.4 57.6 ± 0.3 49.8 ± 0.7 51.2 ± 0.6 50.7 ± 0.7 52.1 ± 0.6 dna 90.9 ± 0.6 90.8 ± 0.5 88.5 ± 0.6 88.5 ± 0.5 77.7 ± 1.5 69.6 ± 2.1 81.9 ± 2.9 30.3 ± 1.7 satimage 92.3 ± 0.4 92.4 ± 0.3 92.5 ± 0.4 92.6 ± 0.4 87.6 ± 0.3 85.4 ± 0.4 88.7 ± 0.5 86.3 ± 0.5 letter 83.8 ± 0.3 83.7 ± 0.3 81.9 ± 0.3 81.7 ± 0.4 81.5 ± 0.5 81.1 ± 0.5 80.2 ± 0.3 79.8 ± 0.5
Introduction Contributions Experiments Summary: Main Benefits Sparse-Som and Sparse-BSom run much faster than their classical “dense” counterparts with sparse data. Advantages of each version: Sparse-Som maps seem to have a better organization Sparse-BSom highly parallelizable more memory efficient (single-precision)
Introduction Contributions Experiments Thank you!

Efficient Implementation of Self-Organizing Map for Sparse Input Data