DEV Community

chrisarg
chrisarg

Posted on

The Quest for Performance Part II : Perl vs Python


Having run a toy performance example, we will now digress somewhat and contrast the performance against
a few Python implementations. First let's set up the stage for the calculations, and provide commandline
capabilities to the Python script.

import argparse import time import math import numpy as np import os from numba import njit from joblib import Parallel, delayed parser = argparse.ArgumentParser() parser.add_argument("--workers", type=int, default=8) parser.add_argument("--arraysize", type=int, default=100_000_000) args = parser.parse_args() # Set the number of threads to 1 for different libraries print("=" * 80) print( f"\nStarting the benchmark for {args.arraysize} elements " f"using {args.workers} threads/workers\n" ) # Generate the data structures for the benchmark array0 = [np.random.rand() for _ in range(args.arraysize)] array1 = array0.copy() array2 = array0.copy() array_in_np = np.array(array1) array_in_np_copy = array_in_np.copy() 
Enter fullscreen mode Exit fullscreen mode

And here are our contestants:

  • Base Python
 for i in range(len(array0)): array0[i] = math.cos(math.sin(math.sqrt(array0[i]))) 
Enter fullscreen mode Exit fullscreen mode
  • Numpy (Single threaded)
np.sqrt(array_in_np, out=array_in_np) np.sin(array_in_np, out=array_in_np) np.cos(array_in_np, out=array_in_np) 
Enter fullscreen mode Exit fullscreen mode
  • Joblib (note that this example is not a true in-place one, but I have not been able to make it run using the out arguments)
 def compute_inplace_with_joblib(chunk): return np.cos(np.sin(np.sqrt(chunk))) #parallel function for joblib  chunks = np.array_split(array1, args.workers) # Split the array into chunks numresults = Parallel(n_jobs=args.workers)( delayed(compute_inplace_with_joblib)(chunk) for chunk in chunks )# Process each chunk in a separate thread array1 = np.concatenate(numresults) # Concatenate the results 
Enter fullscreen mode Exit fullscreen mode
  • Numba
@njit def compute_inplace_with_numba(array): np.sqrt(array,array) np.sin(array,array) np.cos(array,array) ## njit will compile this function to machine code compute_inplace_with_numba(array_in_np_copy) 
Enter fullscreen mode Exit fullscreen mode

And here are the timing results:

In place in ( base Python): 11.42 seconds In place in (Python Joblib): 4.59 seconds In place in ( Python Numba): 2.62 seconds In place in ( Python Numpy): 0.92 seconds 
Enter fullscreen mode Exit fullscreen mode

The numba is surprisingly slower!? Could it be due to the overhead of compilation as pointed out by mohawk2 in an IRC exchange about this issue?
To test this, we should call compute_inplace_with_numba once before we execute the benchmark. Doing so, shows that Numba is now faster than Numpy.

In place in ( base Python): 11.89 seconds In place in (Python Joblib): 4.42 seconds In place in ( Python Numpy): 0.93 seconds In place in ( Python Numba): 0.49 seconds 
Enter fullscreen mode Exit fullscreen mode

Finally, I decided to take base R for ride in the same example:

n<-50000000 x<-runif(n) start_time <- Sys.time() result <- cos(sin(sqrt(x))) end_time <- Sys.time() # Calculate the time taken time_taken <- end_time - start_time # Print the time taken print(sprintf("Time in base R: %.2f seconds", time_taken)) 
Enter fullscreen mode Exit fullscreen mode

which yielded the following timing result:

Time in base R: 1.30 seconds 
Enter fullscreen mode Exit fullscreen mode

Compared to the Perl results we note the following about this example:

  • Inplace operations in base Python were ~ 3.5 slower than Perl
  • Single threaded PDL and numpy gave nearly identical results, followed closely by base R
  • Failure to account for the compilation overhead of Numba yields the false impression that it is slower than Numpy. When accounting for the compilation overhead, Numba is x2 faster than Numpy
  • Parallelization with Joblib did improve upon base Python, but was still inferior to the single thread Perl implementation
  • Multi-threaded PDL (and OpenMP) crushed (not crashed!) every other implementation in all lanugages). Hopefully this post provides some food for thought about the language to use for your next data/compute intensive operation. The next part in this series will look into the same example using arrays in C. This final installment will (hopefully) provide some insights about the impact of memory locality and the overhead incurred by using dynamically typed languages.

Top comments (0)