Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access

Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access Igor Sfiligoi, Daniel McDonald and Rob Knight University of California San Diego

High level overview • Microbiome studies recently moved from 100s to 10k-100k samples • But the tools were built having 1k samples in mind • Many of the tools are written in Python • With NumPy used for compute-intensive portions • This talk focuses on pcoa and mantel, which are part of scikit-bio • The above codes work well at small scales • NumPy takes good care of using the compute resources • But very slow at large(r) scales • Lack of memory locality due to multi-step logic • Re-implemented in Cython, with emphasis on memory locality • Now scales well into the 100k samples range

A word about microbiome studies • Microbiome studies composed of samples from the real world • Often collected from many different environments • DNA sequencing technology used to observe what microorganisms are present in a given sample • Typical questions • How similar the samples are to each other? • Can these similarities be explained by study variables? (e.g., salinity, pH, host or not host associated, etc) • How correlated are distance matrices created by different methods? • Often comparing with samples from other studies, too • Increases the total number of samples to consider

A word about CPU memory subsystem • CPUs compute several orders of magnitude faster than DRAM • Intel Xeon Gold 6242 takes ~100 cycles to load from DRAM • CPUs thus have internal caches to speed up access • But limited in size, due to high cost • L1 cache: 4 cycles, 32 KiB per core • L2 cache: 14 cycles, 1 MiB per core • L3 cache: ~50 cycles, 22.5 MiB per chip (the above is for the Xeon, other CPUs similar) • Minimizing off-cache access thus essential • Not a new concept, but can be overlooked when datasets grow

Principal Coordinates Analysis • The pcoa function is used to perform a Principal Coordinates Analysis of a distance matrix • Two step process • Centering of the matrix • Compute of approximate eigenvalues Surprisingly, the first step was the most expensive!

Centering the matrix • Original implementation simple and elegant • But not memory efficient • Multi-step process • With each arithmetic operation traversing the whole NumPy matrix • Reads 8 and writes 5 matrix-sized buffers import numpy as np def e_matrix(distance_matrix): return distance_matrix * distance_matrix / -2 def f_matrix(E_matrix): row_means = E_matrix.mean(axis=1, keepdims=True); col_means = E_matrix.mean(axis=0, keepdims=True) matrix_mean = E_matrix.mean() return E_matrix - row_means - col_means + matrix_mean def center_distance_matrix(distance_matrix): # distance_matrix must be of type np.ndarray and # be a 2D floating point matrix return f_matrix(e_matrix(distance_matrix)) OK for small matrices, but slow once cannot fit in cache

Centering the matrix • To keep active buffer in cache, must expose parallelism • Move compute to inner loop • And use a tilling pattern • Cannot do in NumPy • Re-implemented in Cython 0/2 def center_distance_matrix(mat, centered): row_means = np.empty(mat.shape[0]) global_mean = e_matrix_means_cy(mat, centered, row_means) f_matrix_inplace_cy(row_means, global_mean, centered)

Centering the matrix • To keep active buffer in cache, must expose parallelism • Move compute to inner loop • And use a tilling pattern • Re-implemented in Cython • Harder to read, but much faster def e_matrix(distance_matrix): return distance_matrix * distance_matrix / -2 def f_matrix(E_matrix): row_means = E_matrix.mean(axis=1, keepdims=True); col_means = E_matrix.mean(axis=0, keepdims=True) … def e_matrix_means_cy(float[:, :] mat, float[:, :] centered, float[:] row_means): n_samples = mat.shape[0] global_sum = 0.0 for row in prange(n_samples, nogil=True): row_sum = 0.0 for col in range(n_samples): val = mat[row,col]; val2 = -0.5*val*val centered[row,col] = val2 row_sum = row_sum + val2 global_sum += row_sum row_means[row] = row_sum/n_samples return (global_sum/n_samples)/n_samples 1/2 Compute means while creating E_matrix

Centering the matrix • To keep active buffer in cache, must expose parallelism • Move compute to inner loop • And use a tilling pattern • Re-implemented in Cython • Harder to read, but much faster def f_matrix_inplace_cy(float[:] row_means, float global_mean, float[:, :] centered): n_samples = centered.shape[0] # use a tiled pattern to maximize locality of row_means, # since they are also column means for trow in prange(0, n_samples, 16, nogil=True): trow_max = min(trow+16 n_samples) for tcol in range(0, n_samples, 16): tcol_max = min(tcol+16, n_samples) for row in range(trow, trow_max, 1): gr_mean = global_mean - row_means[row] for col in range(tcol, tcol_max, 1): centered[row,col] = centered[row,col] + (gr_mean - row_means[col]) def f_matrix(E_matrix): row_means = E_matrix.mean(axis=1, keepdims=True); col_means = E_matrix.mean(axis=0, keepdims=True) matrix_mean = E_matrix.mean() return E_matrix - row_means - col_means + matrix_mean Out 2/2 Extensively use tilling

Observed speedup • Between 2x and 30x, depending on available HW • Bigger improvement on larger problems in full-CPU mode CPU, number of used cores, matrix size Original Latest AMD EPYC 7252 1 core - 25k x 25k 4.1 2.1 AMD EPYC 7252 8 cores - 25k x 25k 4.0 0.3 Intel Xeon Gold 6242 1 core - 25k x 25k 7.7 2.3 Intel Xeon Gold 6242 16 cores - 25k x 25k 7.2 0.3 AMD EPYC 7252 1 core - 70k x 70k 100 30 AMD EPYC 7252 8 cores - 70k x 70k 125 4.2 Intel Xeon Gold 6242 1 core - 100k x 100k 255 78 Intel Xeon Gold 6242 16 cores - 100k x 100k 254 9.1 Note: Original implementation cannot make effective use of multiple cores

The Mantel test • The mantel function computes the correlation between two distance matrices using the Mantel test. • The statistical significance derived from a Monte-Carlo procedure, where the input matrices are permuted K times • Default K=999 • Implication: • The correlation function is typically computed 1k times on a whole matrix Extremely slow if matrix cannot fit in cache

Original mantel implementation • Again, simple and elegant, but not memory efficient • Note that it relies on external functions for its core compute function from scipy.stats import pearsonr def _mantel_stats_pearson(x, y, permutations=999): # pearsonr operates on the flat representation # of the upper triangle of the symmetric matrix x_flat = x.condensed_form() y_flat = y.condensed_form() orig_stat = pearsonr(x_flat, y_flat) perm_gen = ( pearsonr(x.permute(condensed=True), y_flat) for _ in range(permutations) ) permuted_stats = np.fromiter(perm_gen, np.float, count=permutations) return [orig_stat, permuted_stats] def mantel(x, y, permutations=999): orig_stat, permuted_stats = _mantel_stats_pearson(x, y, permutations) count_better = ( np.absolute(permuted_stats) >= np.absolute(orig_stat) ).sum() p_value = (count_better + 1) / (permutations + 1) return orig_stat, p_value

Original mantel implementation • Again, simple and elegant, but not memory efficient • Note that it relies on external functions for its core compute function from scipy.stats import pearsonr def _mantel_stats_pearson(x, y, permutations=999): # pearsonr operates on the flat representation # of the upper triangle of the symmetric matrix x_flat = x.condensed_form() y_flat = y.condensed_form() orig_stat = pearsonr(x_flat, y_flat) perm_gen = ( pearsonr(x.permute(condensed=True), y_flat) for _ in range(permutations) ) permuted_stats = np.fromiter(perm_gen, np.float, count=permutations) return [orig_stat, permuted_stats] def mantel(x, y, permutations=999): orig_stat, permuted_stats = _mantel_stats_pearson(x, y, permutations) count_better = ( np.absolute(permuted_stats) >= np.absolute(orig_stat) ).sum() p_value = (count_better + 1) / (permutations + 1) return orig_stat, p_value Memory locality requires partial compute, so must inline and re-implement external functions, too from linalg import norm def pearsonr(x_flat, y_flat): xm = x_flat – x_flat.mean() ym = y_flat – y_flat.mean() normxm = norm(xm) normym = norm(ym) xnorm = xm/normxm ynorm = ym/normym return np.dot(xnorm, ynorm)

Additional optimization opportunities • With fewer black-boxes, we can optimize more, too • 2nd parameter always the same • The mean and norm of a matrix do not change when the matrix is permuted from scipy.stats import pearsonr def _mantel_stats_pearson(x, y, permutations=999): # pearsonr operates on the flat representation # of the upper triangle of the symmetric matrix x_flat = x.condensed_form() y_flat = y.condensed_form() orig_stat = pearsonr(x_flat, y_flat) perm_gen = ( pearsonr(x.permute(condensed=True), y_flat) for _ in range(permutations) ) permuted_stats = np.fromiter(perm_gen, np.float, count=permutations) return [orig_stat, permuted_stats] def mantel(x, y, permutations=999): orig_stat, permuted_stats = _mantel_stats_pearson(x, y, permutations) count_better = ( np.absolute(permuted_stats) >= np.absolute(orig_stat) ).sum() p_value = (count_better + 1) / (permutations + 1) return orig_stat, p_value from linalg import norm def pearsonr(x_flat, y_flat): xm = x_flat – x_flat.mean() ym = y_flat – y_flat.mean() normxm = norm(xm) normym = norm(ym) xnorm = xm/normxm ynorm = ym/normym return np.dot(xnorm, ynorm) Reuse results

Optimized mantel implementation • Inlining makes it harder to read • But not much longer def mantel_perm_cy(float[:, :] x_data, float[:, :] perm_order, float xmean, float normxm, float[:] ym_normalized, float[:] permuted_stats): perms_n = perm_order.shape[0] out_n = perm_order.shape[1] for p in prange(perms_n, nogil=True): my_ps = 0.0 for row in range(out_n-1): vrow = perm_order[p, row] idx = row*(out_n-1) - ((row-1)*row)/2 for icol in range(out_n-row-1): col = icol+row+1 yval = ym_normalized[idx+icol] xval = x_data[vrow, perm_order[p, col]]*mul + add my_ps = yval*xval + my_ps permuted_stats[p] = my_ps from linalg import norm def _mantel(x, y, permutations): x_flat = x.condensed_form() y_flat = y.condensed_form() xm = x_flat – x_flat.mean() ym = y_flat – y_flat.mean() normxm = norm(xm) normym = norm(ym) xnorm = xm/normxm ynorm = ym/normym orig_stat = np.dot(xnorm, ynorm) mat_n = x._data.shape[0] perm_order = np.empty([permutations, mat_n], dtype=np.int) for row in range(permutations): perm_order[row, :] = np.random.permutation(mat_n) permuted_stats = np.empty([permutations]) mantel_perm_pearsonr_cy(x, perm_order, xmean, normxm, ynorm, permuted_stats) return [orig_stat, permuted_stats] Permuted multi-matrix dot operation Pre-compute anything that can be reused

Optimized mantel implementation • Inlining makes it harder to read • But not much longer def mantel_perm_cy(float[:, :] x_data, float[:, :] perm_order, float xmean, float normxm, float[:] ym_normalized, float[:] permuted_stats): perms_n = perm_order.shape[0] out_n = perm_order.shape[1] for p in prange(perms_n, nogil=True): my_ps = 0.0 for row in range(out_n-1): vrow = perm_order[p, row] idx = row*(out_n-1) - ((row-1)*row)/2 for icol in range(out_n-row-1): col = icol+row+1 yval = ym_normalized[idx+icol] xval = x_data[vrow, perm_order[p, col]]*mul + add my_ps = yval*xval + my_ps permuted_stats[p] = my_ps from linalg import norm def _mantel(x, y, permutations): x_flat = x.condensed_form() y_flat = y.condensed_form() xm = x_flat – x_flat.mean() ym = y_flat – y_flat.mean() normxm = norm(xm) normym = norm(ym) xnorm = xm/normxm ynorm = ym/normym orig_stat = np.dot(xnorm, ynorm) mat_n = x._data.shape[0] perm_order = np.empty([permutations, mat_n], dtype=np.int) for row in range(permutations): perm_order[row, :] = np.random.permutation(mat_n) permuted_stats = np.empty([permutations]) mantel_perm_pearsonr_cy(x, perm_order, xmean, normxm, ynorm, permuted_stats) return [orig_stat, permuted_stats]

Observed speedup • Up to 200x • Bigger improvement on larger problems in full-CPU mode • Game changing for large problems • We would not even consider computing those Again, original implementation cannot make effective use of multiple cores CPU, num. of used cores, matrix size Original Latest AMD EPYC 7252 1 core - 10k ^2 2.19k 149 AMD EPYC 7252 8 cores - 10k ^2 2.17k 24 Intel Xeon Gold 6242 1 core - 10k ^2 4.2k 170 Intel Xeon Gold 6242 16 cores - 10k ^2 4.2k 20 AMD EPYC 7252 1 core - 20k ^2 8.9k 615 AMD EPYC 7252 8 cores - 20k ^2 8.8k 97 Intel Xeon Gold 6242 1 core - 20k ^2 17.9k 933 Intel Xeon Gold 6242 16 cores - 20k ^2 17.5k 108 AMD EPYC 7252 8 cores - 70k ^2 - 1.44k Intel Xeon Gold 6242 16 cores-100k^2 - 4.2k

Validation costs • Even something as trivial as checking for symmetry can be expensive if one does not consider memory locality • A 100k samples distance matrix would take several minutes! CPU, num. of used cores, matrix size Original Latest AMD EPYC 7252 1 core - 25k ^2 1.8 2.6 AMD EPYC 7252 8 cores - 25k ^2 1.8 0.4 Intel Xeon Gold 6242 1 core - 25k ^2 4.4 3.2 Intel Xeon Gold 6242 16 cores - 25k ^2 4.4 0.24 AMD EPYC 7252 1 core - 70k ^2 18 21 AMD EPYC 7252 8 cores - 70k ^2 18 3.2 Intel Xeon Gold 6242 1 core - 100k ^2 219 78 Intel Xeon Gold 6242 16 cores-100k^2 212 5.4 Check performed on each object instantiation!

Validation costs • The solution, again, was Cython with tiling import numpy as np def is_symmetric_and_hollow(mat): # mat must be of type np.ndarray # and be a 2D floating point matrix not_sym = (mat.T != mat).any() not_hollow = (np.trace(mat) != 0) return [not no_sym, not not_hollow] import numpy as np from cython.parallel import prange def is_symmetric_and_hollow_cy(float[:, :] mat): in_n = mat.shape[0] is_sym = True is_hollow = True # use a tiled approach to maximize memory locality for trow in prange(0, in_n, 16, nogil=True): trow_max = min(trow+16, in_n) for tcol in range(0, in_n, 16): tcol_max = min(tcol+16, in_n) for row in range(trow, trow_max, 1): for col in range(tcol, tcol_max, 1): is_sym &= (mat[row,col]==mat[col,row]) if (trow==tcol): # diagonal block for col in range(tcol, tcol_max, 1): is_hollow &= (mat[col,col]==0) return [(is_sym==True), (is_hollow==True)]

Conclusions • We re-implemented two often used functions, pcoa and mantel, plus related object validation function • Now up to 100x faster • Root cause was over-reliance on standard libraries • In particular Python’s NumPy • Code was simple and elegant, but not memory efficient • New implementation uses lower-level constructs • Used Cython to push most of compute into inner loop • This will greatly help the microbiome community • Improvements accepted into scikit-bio library code-base

Acknowledgemnts • This work was partially funded by the US National Research Foundation (NSF) under grants DBI-2038509, OAC-1541349 and OAC-1826967.

Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access

More Related Content

What's hot

Similar to Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access

More from Igor Sfiligoi

Recently uploaded

Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access