29 June 2016 Java Performance Tweaks How I brought TopicViewer’s analysis runtime from 2:10 down to 18 seconds Repo: https://github.com/jimbethancourt/topic-viewer
1 Use a Faster Library 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17 Contents
2 Use a Faster Library • Swapping out off-the-shelf components is the first step you’ll want to take and will likely provide the most immediate gain. • Replacing Colt with ParallelColt was relatively straightforward • No migration guides, but the fact that both libraries were open source and classes had relatively similar names helped. • Reduced the runtime from 2:10 to 0:55 Migrating from Colt to ParallelColt
3 Use a Faster Library 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17
4 Avoid the Unnecessary Don’t use Objects when primitives will provide the same functionality Old clustersTree.findSet(new Vertex(i)).index == rootIndex New clustersTree .findSet(i).index == rootIndex Avoid Unnecessary Object Creation
5 Avoid the Unnecessary Use Object arrays instead of Maps when you primarily read values and can index via numbers: In DisjointTree Original Map<Vertex, Vertex> vertexMapping = new HashMap<Vertex, Vertex>(); findSet(Vertex v) { Vertex mapped = vertexMapping.get(v); …} New Vertex[] vertexMapping = new Vertex[numDocuments]; findSet (int v) { vertexMapping[v.index] = v; …} Brought runtime down to 42 seconds (if I remember correctly) between using arrays and avoiding object creation. Use Object Arrays instead of Maps
6 Avoid the Unnecessary • Avoid Unnecessary calls when there are collections that already perform the operations for you • Constantly calling collection.contains() added up – I was surprised • Avoid complex, object-heavy equals() operations when possible • Splitting work into producer + consumer and leveraging natural properties of collections cut 15 seconds off of runtime • See updateCorrelationMatrix() method and Pair.equals() Avoid Unnecessary Operations
7 Avoid the Unnecessary • Use method parameter values when looping heavily • Method parameters live on the processor stack, not the heap • As a result, accessing them is much faster • Shaved off a second or two Original: private int[] getLeastDissimilarPair(DoubleMatrix2D correlationMatrix, boolean force) { for (int i = 0; i < this.numDocuments; i++) for (int j = 0; j < this.numDocuments; j++) New: private int[] getLeastDissimilarPair(DoubleMatrix2D correlationMatrix, int numDocuments, boolean force) { for (int i = 0; i < numDocuments; i++) for (int j = 0; j < numDocuments; j++) Use Method Parameter Values
8 Avoid the Unnecessary Original: for (int i = 0; i < this.numDocuments; i++) for (int j = 0; j < this.numDocuments; j++) { double similarity = correlationMatrix.get(i, j); if (i < j && similarity != Double.NEGATIVE_INFINITY && similarity > bestSimilarity) New: for (int i = 0; i < numDocuments; i++) for (int j = 0; j < numDocuments; j++) { if (i < j) { double similarity = correlationMatrix.get(i, j); if (similarity != Double.NEGATIVE_INFINITY && similarity > bestSimilarity) Make Absolutely Sure you Need I/O
9 Use a Faster Library 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17
10 Looping for (int i = 0; i < numDocuments; i++) for (int j = 0; j < numDocuments; j++) if (i < j) Is faster than for (int j = 0; j < numDocuments; j++) for (int i = 0; i < j; i++) Likely faster due to reuse of register values and JITting Don’t Get Fancy With Inner Loops
11 Looping It was cheaper to re-run the operation instead of caching in a map: private Set<Integer> getClusterSet(int rootIndex, int numDocuments) { Set<Integer> clusterSet = new LinkedHashSet<>(); for (int i = 0; i < numDocuments; i++) if (this.clustersTree.findSet(i).index == rootIndex) clusterSet.add(i); return clusterSet; } Don't Cache Where it Won't Help
12 Looping https://dzone.com/articles/java-collection-performance was pure gold! Use the Fastest Collection for your Dominant Operation
13 Use a Faster Library 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17
14 Parallelization What do you do when you need to parallelize something like this? for (int i = 0; i < correlationMatrix2D.rows()-1; i++) {…} for (int i = 0; i < correlationMatrix2D.rows()-1; i++) { rowsForUpdating.add(i); } rowsForUpdating.parallelStream().forEach(i -> {…}); Shaved 10 seconds off of runtime. If possible, populate rowsForUpdating only once if quantity is fixed Need to Parallelize an Indexed For Loop?
15 Parallelization When organizing parallelization of processing, perform the parallelization at the highest / outermost level possible. This will reduce the cost of context-switching for the CPU Nesting parallelization was detrimental to runtime when I attempted to do so. Processing time took longer and CPU utilization was higher. Parallelize (only) at the Highest Level Possible
16 Parallelization There is no ConcurrentHashSet data structure in java.util.concurrent  Leverage the fact that ConcurrentHashMap’s key is a HashSet: ConcurrentHashMap<Pair, Integer> calculatedClusters = new ConcurrentHashMap<>(); … calculatedClusters.put(newPair, 0); A ConcurrentHashMap has 16 or more segments and each can be written to at the same time. Use a ConcurrentHashMap for the Key
17 • When performing a large volume of writes, use a non-blocking data structure. • br.ufmg.aserg.topicviewer.util.Double2DMatrix is pretty amazing once I realize how it worked • Each matrix row has its own channel • Each cell in the row has a calculated offset • As a result, there is no contention Non-Blocking Data Structures for Heavy I/O
18 Use a Faster Library 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17
19 Finally -XX:+UseG1GC Saved 2 seconds of runtime! Use the G1 Garbage Collector
20 Finally Java Mission Control is your new best friend -XX:+UnlockCommercialFeatures -XX:+FlightRecorder Profile Profile Profile
21 • 18 seconds • Over 8X performance improvement Final Runtime

Java Performance Tweaks

  • 1.
    29 June 2016 JavaPerformance Tweaks How I brought TopicViewer’s analysis runtime from 2:10 down to 18 seconds Repo: https://github.com/jimbethancourt/topic-viewer
  • 2.
    1 Use a FasterLibrary 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17 Contents
  • 3.
    2 Use a FasterLibrary • Swapping out off-the-shelf components is the first step you’ll want to take and will likely provide the most immediate gain. • Replacing Colt with ParallelColt was relatively straightforward • No migration guides, but the fact that both libraries were open source and classes had relatively similar names helped. • Reduced the runtime from 2:10 to 0:55 Migrating from Colt to ParallelColt
  • 4.
    3 Use a FasterLibrary 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17
  • 5.
    4 Avoid the Unnecessary Don’tuse Objects when primitives will provide the same functionality Old clustersTree.findSet(new Vertex(i)).index == rootIndex New clustersTree .findSet(i).index == rootIndex Avoid Unnecessary Object Creation
  • 6.
    5 Avoid the Unnecessary UseObject arrays instead of Maps when you primarily read values and can index via numbers: In DisjointTree Original Map<Vertex, Vertex> vertexMapping = new HashMap<Vertex, Vertex>(); findSet(Vertex v) { Vertex mapped = vertexMapping.get(v); …} New Vertex[] vertexMapping = new Vertex[numDocuments]; findSet (int v) { vertexMapping[v.index] = v; …} Brought runtime down to 42 seconds (if I remember correctly) between using arrays and avoiding object creation. Use Object Arrays instead of Maps
  • 7.
    6 Avoid the Unnecessary •Avoid Unnecessary calls when there are collections that already perform the operations for you • Constantly calling collection.contains() added up – I was surprised • Avoid complex, object-heavy equals() operations when possible • Splitting work into producer + consumer and leveraging natural properties of collections cut 15 seconds off of runtime • See updateCorrelationMatrix() method and Pair.equals() Avoid Unnecessary Operations
  • 8.
    7 Avoid the Unnecessary •Use method parameter values when looping heavily • Method parameters live on the processor stack, not the heap • As a result, accessing them is much faster • Shaved off a second or two Original: private int[] getLeastDissimilarPair(DoubleMatrix2D correlationMatrix, boolean force) { for (int i = 0; i < this.numDocuments; i++) for (int j = 0; j < this.numDocuments; j++) New: private int[] getLeastDissimilarPair(DoubleMatrix2D correlationMatrix, int numDocuments, boolean force) { for (int i = 0; i < numDocuments; i++) for (int j = 0; j < numDocuments; j++) Use Method Parameter Values
  • 9.
    8 Avoid the Unnecessary Original: for(int i = 0; i < this.numDocuments; i++) for (int j = 0; j < this.numDocuments; j++) { double similarity = correlationMatrix.get(i, j); if (i < j && similarity != Double.NEGATIVE_INFINITY && similarity > bestSimilarity) New: for (int i = 0; i < numDocuments; i++) for (int j = 0; j < numDocuments; j++) { if (i < j) { double similarity = correlationMatrix.get(i, j); if (similarity != Double.NEGATIVE_INFINITY && similarity > bestSimilarity) Make Absolutely Sure you Need I/O
  • 10.
    9 Use a FasterLibrary 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17
  • 11.
    10 Looping for (int i= 0; i < numDocuments; i++) for (int j = 0; j < numDocuments; j++) if (i < j) Is faster than for (int j = 0; j < numDocuments; j++) for (int i = 0; i < j; i++) Likely faster due to reuse of register values and JITting Don’t Get Fancy With Inner Loops
  • 12.
    11 Looping It was cheaperto re-run the operation instead of caching in a map: private Set<Integer> getClusterSet(int rootIndex, int numDocuments) { Set<Integer> clusterSet = new LinkedHashSet<>(); for (int i = 0; i < numDocuments; i++) if (this.clustersTree.findSet(i).index == rootIndex) clusterSet.add(i); return clusterSet; } Don't Cache Where it Won't Help
  • 13.
    12 Looping https://dzone.com/articles/java-collection-performance was puregold! Use the Fastest Collection for your Dominant Operation
  • 14.
    13 Use a FasterLibrary 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17
  • 15.
    14 Parallelization What do youdo when you need to parallelize something like this? for (int i = 0; i < correlationMatrix2D.rows()-1; i++) {…} for (int i = 0; i < correlationMatrix2D.rows()-1; i++) { rowsForUpdating.add(i); } rowsForUpdating.parallelStream().forEach(i -> {…}); Shaved 10 seconds off of runtime. If possible, populate rowsForUpdating only once if quantity is fixed Need to Parallelize an Indexed For Loop?
  • 16.
    15 Parallelization When organizing parallelizationof processing, perform the parallelization at the highest / outermost level possible. This will reduce the cost of context-switching for the CPU Nesting parallelization was detrimental to runtime when I attempted to do so. Processing time took longer and CPU utilization was higher. Parallelize (only) at the Highest Level Possible
  • 17.
    16 Parallelization There is noConcurrentHashSet data structure in java.util.concurrent  Leverage the fact that ConcurrentHashMap’s key is a HashSet: ConcurrentHashMap<Pair, Integer> calculatedClusters = new ConcurrentHashMap<>(); … calculatedClusters.put(newPair, 0); A ConcurrentHashMap has 16 or more segments and each can be written to at the same time. Use a ConcurrentHashMap for the Key
  • 18.
    17 • When performinga large volume of writes, use a non-blocking data structure. • br.ufmg.aserg.topicviewer.util.Double2DMatrix is pretty amazing once I realize how it worked • Each matrix row has its own channel • Each cell in the row has a calculated offset • As a result, there is no contention Non-Blocking Data Structures for Heavy I/O
  • 19.
    18 Use a FasterLibrary 1 Avoid the Unnecessary 3 Looping 9 Parallelization 13 Finally 17
  • 20.
    19 Finally -XX:+UseG1GC Saved 2 secondsof runtime! Use the G1 Garbage Collector
  • 21.
    20 Finally Java Mission Controlis your new best friend -XX:+UnlockCommercialFeatures -XX:+FlightRecorder Profile Profile Profile
  • 22.
    21 • 18 seconds •Over 8X performance improvement Final Runtime