DEV Community

Cover image for Improving the performance of your code starting with Go
Kenta Takeuchi
Kenta Takeuchi

Posted on • Edited on • Originally published at bmf-tech.com

Improving the performance of your code starting with Go

This article is from bmf-tech.com - Goで始めるコードのパフォーマンス改善.

Day 9 of Makuake Advent Calendar 2022!

Improving the performance of your code starting with Go

When I decided to improve the performance of goblin, a homebrew HTTP Router, I tried to work on performance improvement in Go, so I write about my approach and the efforts I put into practice.

Prerequisites

I'm sure there is more knowledge needed for more in-depth tuning, but I'll just list the bare minimum required.

  • Garbage Collection
    • A function that automatically releases memory areas allocated by a program at runtime that are no longer needed.
  • Memory area
    • Text area
      • Area where a program converted into machine language can be stored
    • Stack area
      • Memory area Memory area allocated at program execution
      • Data whose size is fixed at runtime is targeted
      • Automatically released.e.g., when a function finishes executing and is no longer needed)
      • Ex. arguments, return values, temporary variables, etc.
    • Heap area
      • Memory area allocated during program execution
      • Data whose size is dynamically determined
      • Subject to garbage collection
    • Static area
      • Memory area allocated during program execution
      • Allocated until the program is terminated
      • Ex. global variables, static variables, etc.

Approach to Performance Improvement

The assumption is that there is a need to improve performance (is it worth sacrificing readability, can we determine that the application is the bottleneck in the first place, etc.), but we will proceed under the assumption that there is a need.

Some ways to improve the performance of the code include as follows:

  • Algorithm optimization 
  • Optimization of data structures
  • Use of cache
  • Application of parallel processing
  • Compile optimization

There are a number of things that come to mind, but before implementing improvements, measurement and analysis should be performed.
(It is assumed that the need for performance improvement is more important than measurement, but that depends on the needs of each individual, and is not discussed here.)

We will introduce some of the packages and tools for measurement and analysis in Go.

Benchmarks

Go includes Benchmarks in the standard package testing to get a benchmark of your code.

You can run the command go test -bench=. -benchmem to obtain a benchmark.

 package main import ( "math/rand" "testing" ) func BenchmarkRandIn(b *testing.B) { for i := 0; i < b.N; i++ { // b.N automatically specifies the number of times the benchmark can be trusted rand.Int() // Function to be measured } } 
Enter fullscreen mode Exit fullscreen mode

The output result will look like this.

 goos: darwin goarch: amd64 pkg: bmf-san/go-perfomance-tips cpu: VirtualApple @ 2.50GHz BenchmarkRandIn-8 87550500 13.53 ns/op 0 B/op 0 allocs/op PASS ok bmf-san/go-perfomance-tips 1.381s 
Enter fullscreen mode Exit fullscreen mode

The benchmark results that can be read from this are as follows:

  • 87550500
    • Number of function executions
    • The higher the number of executions, the better the performance is considered
  • 13.53 ns/op
    • Time required per function execution
    • Less time is considered better performance
  • 0 B/op
    • Size of memory allocated per function execution
    • The smaller the number, the better the performance is considered to be
  • 0 allocs/op
    • Number of memory allocations made per function execution
    • The fewer the number of allocations, the better the performance

Go allows for easy benchmarking in this way.

For other Go benchmarking features, see the documentation.
Benchmarks

The package benchstat is a good tool to compare benchmark results, showing the percentage of improvement in the benchmark results. The package benchstat is a good tool to compare benchmark results, as it shows the percentage of improvement.

The package bmf-san/goblin, which I manage, is incorporated into CI so that the results can be compared before and after a commit.

 // This is an example where nothing has improved... go test -bench . -benchmem -count 1 > new.out benchstat old.out new.out name old time/op new time/op delta Static1-36 248ns ± 0% 246ns ± 0% ~ (p=1.000 n=1+1) Static5-36 502ns ± 0% 495ns ± 0% ~ (p=1.000 n=1+1) Static10-36 874ns ± 0% 881ns ± 0% ~ (p=1.000 n=1+1) WildCard1-36 1.60µs ± 0% 1.50µs ± 0% ~ (p=1.000 n=1+1) WildCard5-36 4.49µs ± 0% 4.92µs ± 0% ~ (p=1.000 n=1+1) WildCard10-36 7.68µs ± 0% 7.58µs ± 0% ~ (p=1.000 n=1+1) Regexp1-36 1.38µs ± 0% 1.48µs ± 0% ~ (p=1.000 n=1+1) Regexp5-36 4.30µs ± 0% 4.11µs ± 0% ~ (p=1.000 n=1+1) Regexp10-36 7.66µs ± 0% 7.18µs ± 0% ~ (p=1.000 n=1+1) 
Enter fullscreen mode Exit fullscreen mode

Absolutely no performance degradation allowed! In such a case, it may be a good idea to use a mechanism to make CI fail.

If you want to check the actual memory allocation by looking at the results of such a benchmark, you can check it by building with the build option.

 package main import "fmt" // Run build with go build -o example -gcflags '-m' gcflagsexample.go func main() { a := "hello" b := "world" fmt.Println(a + b) } 
Enter fullscreen mode Exit fullscreen mode

Running go build -o example -gcflags '-m' gcflagsexample.go produces the following output.

 # command-line-arguments ./gcflagsexample.go:9:13: inlining call to fmt.Println ./gcflagsexample.go:9:13: ... argument does not escape ./gcflagsexample.go:9:16: a + b escapes to heap ./gcflagsexample.go:9:16: a + b escapes to heap 
Enter fullscreen mode Exit fullscreen mode

This is a simple example that is obvious, but it is also a useful method of analysis because memory allocation can be improved by identifying allocations to the heap in this way and reducing heap allocations.

Profiling

Go has a tool called pprof to analyze where the bottlenecks are at the function level.

 package main import ( "sort" "testing" ) func sortAlphabetically() { s := []string{"abc", "aba", "cba", "acb"} sort.Strings(s) } func BenchmarkSortAlphabetically(b *testing.B) { for i := 0; i < b.N; i++ { sortAlphabetically() } } 
Enter fullscreen mode Exit fullscreen mode

If you want to see the CPU profile, run the following.

go test -test.bench=BenchmarkSortAlphabetically -cpuprofile cpu.out && go tool pprof -http=":8888" cpu.out

cpu_profile

If you want to see the memory profile, run the following.

go test -test.bench=BenchmarkSortAlphabetically profilingexample_test.go -memprofile mem.out && go tool pprof -http=":8889" mem.out

memory_profile

Utilizing the UI of pprof makes it easier to identify where the bottleneck is in the process.

Practice

Raise an example of improving goblin, a home-made HTTP Router.

The PR on the subject is here.
Reduce the memory allocation by refactoring explodePath method #68

goblin is an HTTP Router that works well with the net/http interface based on a tri-tree.

As for functionality, it has the minimum that is considered necessary for routing.
cf. goblin#features

Benchmarks

First, we run a benchmark test to measure performance.

 go test -bench=. -cpu=1 -benchmem 
Enter fullscreen mode Exit fullscreen mode

The benchmark test has about three patterns for each test case: static routing (ex. /foo/bar), dynamic routing (ex. /foo/:bar), and routing using regular expressions (ex. /foo/:bar[^\d+$]).

The routing process involves

  1. put data into the tree structure (≒define routing)
  2. search for data from the tree structure (return data based on the requested path)

In this test case, only the latter is measured.

The output results are as follows:

 goos: darwin goarch: amd64 pkg: github.com/bmf-san/goblin cpu: VirtualApple @ 2.50GHz BenchmarkStatic1 5072353 240.1 ns/op 128 B/op 4 allocs/op BenchmarkStatic5 2491546 490.0 ns/op 384 B/op 6 allocs/op BenchmarkStatic10 1653658 729.6 ns/op 720 B/op 7 allocs/op BenchmarkWildCard1 1602606 747.3 ns/op 456 B/op 9 allocs/op BenchmarkWildCard5 435784 2716 ns/op 1016 B/op 23 allocs/op BenchmarkWildCard10 246729 5033 ns/op 1680 B/op 35 allocs/op BenchmarkRegexp1 1647258 733.2 ns/op 456 B/op 9 allocs/op BenchmarkRegexp5 456652 2641 ns/op 1016 B/op 23 allocs/op BenchmarkRegexp10 251998 4780 ns/op 1680 B/op 35 allocs/op PASS ok github.com/bmf-san/goblin 14.304s 
Enter fullscreen mode Exit fullscreen mode

Several trends can be read into each of the number of executions, number of executions per run, memory size per run, and number of memory allocations.

I am personally concerned that memory allocations are occurring even for static routing. (Other HTTP Router benchmarks show 0 allocs.)

Profiling

Next, use pprof to obtain a profile.

This time, we focus only on memory to obtain a profile.

 go test -bench . -memprofile mem.out && go tool pprof -http=":8889" mem.out 
Enter fullscreen mode Exit fullscreen mode

Graph output results.
! pprof_graph

We can see that the process with the largest box (using the most memory) is explodePath.

Even if you look at the Top (list in order of longest execution time), explodePath is at the top of the list.

pprof_top

Flat is the processing time of the function, and Cum is the processing time including waiting time.

Furthermore, check Source to see where in the function the processing is actually heavy.

pprof_source

Since Search is the core process responsible for the router's matching process, I thought it would be the bottleneck there, but it turns out that a specific process within it, explodePath, is the bottleneck.

Tuning

The process of explodePath is to split the received string by / and return it as a []string type.

 // explodePath removes an empty value in slice. func explodePath(path string) []string { s := strings.Split(path, pathDelimiter) var r []string for _, str := range s { if str != "" { r = append(r, str) } } return r } 
Enter fullscreen mode Exit fullscreen mode

Test code is also included for easy understanding of the specifications.

 func TestExplodePath(t *testing.T) { cases := []struct { actual []string expected []string }{ { actual: explodePath(""), expected: nil, }, { actual: explodePath("/"), expected: nil, }, { actual: explodePath("//"), expected: nil, }, { actual: explodePath("///"), expected: nil, }, { actual: explodePath("/foo"), expected: []string{"foo"}, }, { actual: explodePath("/foo/bar"), expected: []string{"foo", "bar"}, }, { actual: explodePath("/foo/bar/baz"), expected: []string{"foo", "bar", "baz"}, }, { actual: explodePath("/foo/bar/baz/"), expected: []string{"foo", "bar", "baz"}, }, } for _, c := range cases { if !reflect.DeepEqual(c.actual, c.expected) { t.Errorf("actual:%v expected:%v", c.actual, c.expected) } } } 
Enter fullscreen mode Exit fullscreen mode

Since the variable r defined in the [[]string type has no capacity defined, it is inferred that memory efficiency is likely to be poor.

The following is a benchmark test and results of adding append to slice prepared for verification.

 package main import "testing" func BenchmarkSliceLen0Cap0(b *testing.B) { var s []int b.StartTimer() for i := 0; i < b.N; i++ { s = append(s, i) } b.StopTimer() } func BenchmarkSliceLenN(b *testing.B) { var s = make([]int, b.N) b.StartTimer() for i := 0; i < b.N; i++ { s = append(s, i) } b.StopTimer() } func BenchmarkSliceLen0CapN(b *testing.B) { var s = make([]int, 0, b.N) b.StartTimer() for i := 0; i < b.N; i++ { s = append(s, i) } b.StopTimer() } 
Enter fullscreen mode Exit fullscreen mode
 goos: darwin goarch: amd64 pkg: example.com cpu: VirtualApple @ 2.50GHz BenchmarkSliceLen0Cap0 100000000 11.88 ns/op 45 B/op 0 allocs/op BenchmarkSliceLenN 78467056 23.60 ns/op 65 B/op 0 allocs/op BenchmarkSliceLen0CapN 616491007 5.057 ns/op 8 B/op 0 allocs/op PASS ok example.com 6.898s bmf@bmfnoMacBook-Air:~/Desktop$ 
Enter fullscreen mode Exit fullscreen mode

This result suggests that the code could be somewhat more efficient by specifying the capacity.

Therefore, modify explodePath as follows.

 func explodePath(path string) []string { s := strings.Split(path, "/") // var r []string r := make([]string, 0, strings.Count(path, "/")) // Specify capacity for _, str := range s { if str != "" { r = append(r, str) } } return r } 
Enter fullscreen mode Exit fullscreen mode

A little more in-depth and improved implementation.

 func explodePath(path string) []string { splitFn := func(c rune) bool { return c == '/' } return strings.FieldsFunc(path, splitFn) } 
Enter fullscreen mode Exit fullscreen mode

We will compare benchmarks with three patterns: the original explodePath implementation, an implementation with reserved slice capacity, and an implementation using strings.FieldFunc.

 package main import ( "strings" "testing" ) func explodePath(path string) []string { s := strings.Split(path, "/") var r []string for _, str := range s { if str != "" { r = append(r, str) } } return r } func explodePathCap(path string) []string { s := strings.Split(path, "/") r := make([]string, 0, strings.Count(path, "/")) for _, str := range s { if str != "" { r = append(r, str) } } return r } func explodePathFieldsFunc(path string) []string { splitFn := func(c rune) bool { return c == '/' } return strings.FieldsFunc(path, splitFn) } func BenchmarkExplodePath(b *testing.B) { paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"} b.StartTimer() for i := 0; i < b.N; i++ { for _, v := range paths { explodePath(v) } } b.StopTimer() } func BenchmarkExplodePathCap(b *testing.B) { paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"} b.StartTimer() for i := 0; i < b.N; i++ { for _, v := range paths { explodePathCap(v) } } b.StopTimer() } func BenchmarkExplodePathFieldsFunc(b *testing.B) { paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"} b.StartTimer() for i := 0; i < b.N; i++ { for _, v := range paths { explodePathFieldsFunc(v) } } b.StopTimer() } 
Enter fullscreen mode Exit fullscreen mode
 goos: darwin goarch: amd64 pkg: example.com cpu: VirtualApple @ 2.50GHz BenchmarkExplodePath 1690340 722.2 ns/op 432 B/op 12 allocs/op BenchmarkExplodePathCap 1622161 729.5 ns/op 416 B/op 11 allocs/op BenchmarkExplodePathFieldsFunc 4948364 239.5 ns/op 96 B/op 3 allocs/op PASS ok example.com 5.685s 
Enter fullscreen mode Exit fullscreen mode

The implementation using strings.PathFieldFunc seems to have the best performance, so it is adopted.

Measuring Effectiveness

Check the results after improving the implementation of explodePath.

Benchmarks

 # Before goos: darwin goarch: amd64 pkg: github.com/bmf-san/goblin cpu: VirtualApple @ 2.50GHz BenchmarkStatic1 5072353 240.1 ns/op 128 B/op 4 allocs/op BenchmarkStatic5 2491546 490.0 ns/op 384 B/op 6 allocs/op BenchmarkStatic10 1653658 729.6 ns/op 720 B/op 7 allocs/op BenchmarkWildCard1 1602606 747.3 ns/op 456 B/op 9 allocs/op BenchmarkWildCard5 435784 2716 ns/op 1016 B/op 23 allocs/op BenchmarkWildCard10 246729 5033 ns/op 1680 B/op 35 allocs/op BenchmarkRegexp1 1647258 733.2 ns/op 456 B/op 9 allocs/op BenchmarkRegexp5 456652 2641 ns/op 1016 B/op 23 allocs/op BenchmarkRegexp10 251998 4780 ns/op 1680 B/op 35 allocs/op PASS ok github.com/bmf-san/goblin 14.304s # After go test -bench=. -cpu=1 -benchmem -count=1 goos: darwin goarch: amd64 pkg: github.com/bmf-san/goblin cpu: VirtualApple @ 2.50GHz BenchmarkStatic1 10310658 117.7 ns/op 32 B/op 1 allocs/op BenchmarkStatic5 4774347 258.1 ns/op 96 B/op 1 allocs/op BenchmarkStatic10 2816960 435.8 ns/op 176 B/op 1 allocs/op BenchmarkWildCard1 1867770 653.4 ns/op 384 B/op 6 allocs/op BenchmarkWildCard5 496778 2484 ns/op 928 B/op 13 allocs/op BenchmarkWildCard10 258508 4538 ns/op 1560 B/op 19 allocs/op BenchmarkRegexp1 1978704 608.4 ns/op 384 B/op 6 allocs/op BenchmarkRegexp5 519240 2394 ns/op 928 B/op 13 allocs/op BenchmarkRegexp10 280741 4309 ns/op 1560 B/op 19 allocs/op PASS ok github.com/bmf-san/goblin 13.666s 
Enter fullscreen mode Exit fullscreen mode

Comparing before and after improvements, we can say that the overall trend is toward improvement.

Profiling

pprof's graph.

pprof_graph_after

pprof's top.

pprof_top_after

You can see that the bottleneck has moved to strings.FieldsFunc, which is called in the explodePath.

Further Improvements

Here is the tag released after making other improvements to goblin.
6.0.0

Unfortunately, there are no significant improvements in the data structure and algorithms, so to speak, so there are no remarkable improvements.

I have a feeling that the data structure and algorithms they are using now are still difficult to use. (I have seen other routers that use more advanced trees, so I guess that's true...)

This is somewhat off topic, but I created a bench marker to compare with other routers and see if I can get some hints for improvement.

bmf-san/go-router-benchmark

It was interesting to compare them and see how ragged they are.

I would like to improve it by studying other router implementations, learning about advanced tree structures, etc., which I failed to do before.

Summary

  • Go makes benchmarking and profiling easy.
  • Don't guess, measure!
  • It is hard to get big results with small improvements (that's true)

Reference

Top comments (0)