Posted on Dec 9, 2022 • Edited on Dec 20, 2022 • Originally published at bmf-tech.com

Improving the performance of your code starting with Go

This article is from bmf-tech.com - Goで始めるコードのパフォーマンス改善.

Day 9 of Makuake Advent Calendar 2022!

Improving the performance of your code starting with Go

When I decided to improve the performance of goblin, a homebrew HTTP Router, I tried to work on performance improvement in Go, so I write about my approach and the efforts I put into practice.

Prerequisites

I'm sure there is more knowledge needed for more in-depth tuning, but I'll just list the bare minimum required.

Garbage Collection
- A function that automatically releases memory areas allocated by a program at runtime that are no longer needed.
Memory area
- Text area
  - Area where a program converted into machine language can be stored
- Stack area
  - Memory area Memory area allocated at program execution
  - Data whose size is fixed at runtime is targeted
  - Automatically released.e.g., when a function finishes executing and is no longer needed)
  - Ex. arguments, return values, temporary variables, etc.
- Heap area
  - Memory area allocated during program execution
  - Data whose size is dynamically determined
  - Subject to garbage collection
- Static area
  - Memory area allocated during program execution
  - Allocated until the program is terminated
  - Ex. global variables, static variables, etc.

Approach to Performance Improvement

The assumption is that there is a need to improve performance (is it worth sacrificing readability, can we determine that the application is the bottleneck in the first place, etc.), but we will proceed under the assumption that there is a need.

Some ways to improve the performance of the code include as follows:

Algorithm optimization　
Optimization of data structures
Use of cache
Application of parallel processing
Compile optimization

There are a number of things that come to mind, but before implementing improvements, measurement and analysis should be performed.
(It is assumed that the need for performance improvement is more important than measurement, but that depends on the needs of each individual, and is not discussed here.)

We will introduce some of the packages and tools for measurement and analysis in Go.

Benchmarks

Go includes Benchmarks in the standard package testing to get a benchmark of your code.

You can run the command go test -bench=. -benchmem to obtain a benchmark.

 package main import ( "math/rand" "testing" ) func BenchmarkRandIn(b *testing.B) { for i := 0; i < b.N; i++ { // b.N automatically specifies the number of times the benchmark can be trusted rand.Int() // Function to be measured } }

The output result will look like this.

 goos: darwin goarch: amd64 pkg: bmf-san/go-perfomance-tips cpu: VirtualApple @ 2.50GHz BenchmarkRandIn-8 87550500 13.53 ns/op 0 B/op 0 allocs/op PASS ok bmf-san/go-perfomance-tips 1.381s

The benchmark results that can be read from this are as follows:

87550500
- Number of function executions
- The higher the number of executions, the better the performance is considered
13.53 ns/op
- Time required per function execution
- Less time is considered better performance
0 B/op
- Size of memory allocated per function execution
- The smaller the number, the better the performance is considered to be
0 allocs/op
- Number of memory allocations made per function execution
- The fewer the number of allocations, the better the performance

Go allows for easy benchmarking in this way.

For other Go benchmarking features, see the documentation.
Benchmarks

The package benchstat is a good tool to compare benchmark results, showing the percentage of improvement in the benchmark results. The package benchstat is a good tool to compare benchmark results, as it shows the percentage of improvement.

The package bmf-san/goblin, which I manage, is incorporated into CI so that the results can be compared before and after a commit.

 // This is an example where nothing has improved... go test -bench . -benchmem -count 1 > new.out benchstat old.out new.out name old time/op new time/op delta Static1-36 248ns ± 0% 246ns ± 0% ~ (p=1.000 n=1+1) Static5-36 502ns ± 0% 495ns ± 0% ~ (p=1.000 n=1+1) Static10-36 874ns ± 0% 881ns ± 0% ~ (p=1.000 n=1+1) WildCard1-36 1.60µs ± 0% 1.50µs ± 0% ~ (p=1.000 n=1+1) WildCard5-36 4.49µs ± 0% 4.92µs ± 0% ~ (p=1.000 n=1+1) WildCard10-36 7.68µs ± 0% 7.58µs ± 0% ~ (p=1.000 n=1+1) Regexp1-36 1.38µs ± 0% 1.48µs ± 0% ~ (p=1.000 n=1+1) Regexp5-36 4.30µs ± 0% 4.11µs ± 0% ~ (p=1.000 n=1+1) Regexp10-36 7.66µs ± 0% 7.18µs ± 0% ~ (p=1.000 n=1+1)

Absolutely no performance degradation allowed! In such a case, it may be a good idea to use a mechanism to make CI fail.

If you want to check the actual memory allocation by looking at the results of such a benchmark, you can check it by building with the build option.

 package main import "fmt" // Run build with go build -o example -gcflags '-m' gcflagsexample.go func main() { a := "hello" b := "world" fmt.Println(a + b) }

Running go build -o example -gcflags '-m' gcflagsexample.go produces the following output.

 # command-line-arguments ./gcflagsexample.go:9:13: inlining call to fmt.Println ./gcflagsexample.go:9:13: ... argument does not escape ./gcflagsexample.go:9:16: a + b escapes to heap ./gcflagsexample.go:9:16: a + b escapes to heap

This is a simple example that is obvious, but it is also a useful method of analysis because memory allocation can be improved by identifying allocations to the heap in this way and reducing heap allocations.

Profiling

Go has a tool called pprof to analyze where the bottlenecks are at the function level.

 package main import ( "sort" "testing" ) func sortAlphabetically() { s := []string{"abc", "aba", "cba", "acb"} sort.Strings(s) } func BenchmarkSortAlphabetically(b *testing.B) { for i := 0; i < b.N; i++ { sortAlphabetically() } }

If you want to see the CPU profile, run the following.

go test -test.bench=BenchmarkSortAlphabetically -cpuprofile cpu.out && go tool pprof -http=":8888" cpu.out

If you want to see the memory profile, run the following.

go test -test.bench=BenchmarkSortAlphabetically profilingexample_test.go -memprofile mem.out && go tool pprof -http=":8889" mem.out

Utilizing the UI of pprof makes it easier to identify where the bottleneck is in the process.

Practice

Raise an example of improving goblin, a home-made HTTP Router.

The PR on the subject is here.
Reduce the memory allocation by refactoring explodePath method #68

goblin is an HTTP Router that works well with the net/http interface based on a tri-tree.

As for functionality, it has the minimum that is considered necessary for routing.
cf. goblin#features

Benchmarks

First, we run a benchmark test to measure performance.

 go test -bench=. -cpu=1 -benchmem

The benchmark test has about three patterns for each test case: static routing (ex. /foo/bar), dynamic routing (ex. /foo/:bar), and routing using regular expressions (ex. /foo/:bar[^\d+$]).

The routing process involves

put data into the tree structure (≒define routing)
search for data from the tree structure (return data based on the requested path)

In this test case, only the latter is measured.

The output results are as follows:

 goos: darwin goarch: amd64 pkg: github.com/bmf-san/goblin cpu: VirtualApple @ 2.50GHz BenchmarkStatic1 5072353 240.1 ns/op 128 B/op 4 allocs/op BenchmarkStatic5 2491546 490.0 ns/op 384 B/op 6 allocs/op BenchmarkStatic10 1653658 729.6 ns/op 720 B/op 7 allocs/op BenchmarkWildCard1 1602606 747.3 ns/op 456 B/op 9 allocs/op BenchmarkWildCard5 435784 2716 ns/op 1016 B/op 23 allocs/op BenchmarkWildCard10 246729 5033 ns/op 1680 B/op 35 allocs/op BenchmarkRegexp1 1647258 733.2 ns/op 456 B/op 9 allocs/op BenchmarkRegexp5 456652 2641 ns/op 1016 B/op 23 allocs/op BenchmarkRegexp10 251998 4780 ns/op 1680 B/op 35 allocs/op PASS ok github.com/bmf-san/goblin 14.304s

Several trends can be read into each of the number of executions, number of executions per run, memory size per run, and number of memory allocations.

I am personally concerned that memory allocations are occurring even for static routing. (Other HTTP Router benchmarks show 0 allocs.)

Profiling

Next, use pprof to obtain a profile.

This time, we focus only on memory to obtain a profile.

 go test -bench . -memprofile mem.out && go tool pprof -http=":8889" mem.out

Graph output results.
! pprof_graph

We can see that the process with the largest box (using the most memory) is explodePath.

Even if you look at the Top (list in order of longest execution time), explodePath is at the top of the list.

Flat is the processing time of the function, and Cum is the processing time including waiting time.

Furthermore, check Source to see where in the function the processing is actually heavy.

Since Search is the core process responsible for the router's matching process, I thought it would be the bottleneck there, but it turns out that a specific process within it, explodePath, is the bottleneck.

Tuning

The process of explodePath is to split the received string by / and return it as a []string type.

 // explodePath removes an empty value in slice. func explodePath(path string) []string { s := strings.Split(path, pathDelimiter) var r []string for _, str := range s { if str != "" { r = append(r, str) } } return r }

Test code is also included for easy understanding of the specifications.

 func TestExplodePath(t *testing.T) { cases := []struct { actual []string expected []string }{ { actual: explodePath(""), expected: nil, }, { actual: explodePath("/"), expected: nil, }, { actual: explodePath("//"), expected: nil, }, { actual: explodePath("///"), expected: nil, }, { actual: explodePath("/foo"), expected: []string{"foo"}, }, { actual: explodePath("/foo/bar"), expected: []string{"foo", "bar"}, }, { actual: explodePath("/foo/bar/baz"), expected: []string{"foo", "bar", "baz"}, }, { actual: explodePath("/foo/bar/baz/"), expected: []string{"foo", "bar", "baz"}, }, } for _, c := range cases { if !reflect.DeepEqual(c.actual, c.expected) { t.Errorf("actual:%v expected:%v", c.actual, c.expected) } } }

Since the variable r defined in the [[]string type has no capacity defined, it is inferred that memory efficiency is likely to be poor.

The following is a benchmark test and results of adding append to slice prepared for verification.

 package main import "testing" func BenchmarkSliceLen0Cap0(b *testing.B) { var s []int b.StartTimer() for i := 0; i < b.N; i++ { s = append(s, i) } b.StopTimer() } func BenchmarkSliceLenN(b *testing.B) { var s = make([]int, b.N) b.StartTimer() for i := 0; i < b.N; i++ { s = append(s, i) } b.StopTimer() } func BenchmarkSliceLen0CapN(b *testing.B) { var s = make([]int, 0, b.N) b.StartTimer() for i := 0; i < b.N; i++ { s = append(s, i) } b.StopTimer() }

 goos: darwin goarch: amd64 pkg: example.com cpu: VirtualApple @ 2.50GHz BenchmarkSliceLen0Cap0 100000000 11.88 ns/op 45 B/op 0 allocs/op BenchmarkSliceLenN 78467056 23.60 ns/op 65 B/op 0 allocs/op BenchmarkSliceLen0CapN 616491007 5.057 ns/op 8 B/op 0 allocs/op PASS ok example.com 6.898s bmf@bmfnoMacBook-Air:~/Desktop$

This result suggests that the code could be somewhat more efficient by specifying the capacity.

Therefore, modify explodePath as follows.

 func explodePath(path string) []string { s := strings.Split(path, "/") // var r []string r := make([]string, 0, strings.Count(path, "/")) // Specify capacity for _, str := range s { if str != "" { r = append(r, str) } } return r }

A little more in-depth and improved implementation.

 func explodePath(path string) []string { splitFn := func(c rune) bool { return c == '/' } return strings.FieldsFunc(path, splitFn) }

We will compare benchmarks with three patterns: the original explodePath implementation, an implementation with reserved slice capacity, and an implementation using strings.FieldFunc.

 package main import ( "strings" "testing" ) func explodePath(path string) []string { s := strings.Split(path, "/") var r []string for _, str := range s { if str != "" { r = append(r, str) } } return r } func explodePathCap(path string) []string { s := strings.Split(path, "/") r := make([]string, 0, strings.Count(path, "/")) for _, str := range s { if str != "" { r = append(r, str) } } return r } func explodePathFieldsFunc(path string) []string { splitFn := func(c rune) bool { return c == '/' } return strings.FieldsFunc(path, splitFn) } func BenchmarkExplodePath(b *testing.B) { paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"} b.StartTimer() for i := 0; i < b.N; i++ { for _, v := range paths { explodePath(v) } } b.StopTimer() } func BenchmarkExplodePathCap(b *testing.B) { paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"} b.StartTimer() for i := 0; i < b.N; i++ { for _, v := range paths { explodePathCap(v) } } b.StopTimer() } func BenchmarkExplodePathFieldsFunc(b *testing.B) { paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"} b.StartTimer() for i := 0; i < b.N; i++ { for _, v := range paths { explodePathFieldsFunc(v) } } b.StopTimer() }

 goos: darwin goarch: amd64 pkg: example.com cpu: VirtualApple @ 2.50GHz BenchmarkExplodePath 1690340 722.2 ns/op 432 B/op 12 allocs/op BenchmarkExplodePathCap 1622161 729.5 ns/op 416 B/op 11 allocs/op BenchmarkExplodePathFieldsFunc 4948364 239.5 ns/op 96 B/op 3 allocs/op PASS ok example.com 5.685s

The implementation using strings.PathFieldFunc seems to have the best performance, so it is adopted.

Measuring Effectiveness

Check the results after improving the implementation of explodePath.

Benchmarks

 # Before goos: darwin goarch: amd64 pkg: github.com/bmf-san/goblin cpu: VirtualApple @ 2.50GHz BenchmarkStatic1 5072353 240.1 ns/op 128 B/op 4 allocs/op BenchmarkStatic5 2491546 490.0 ns/op 384 B/op 6 allocs/op BenchmarkStatic10 1653658 729.6 ns/op 720 B/op 7 allocs/op BenchmarkWildCard1 1602606 747.3 ns/op 456 B/op 9 allocs/op BenchmarkWildCard5 435784 2716 ns/op 1016 B/op 23 allocs/op BenchmarkWildCard10 246729 5033 ns/op 1680 B/op 35 allocs/op BenchmarkRegexp1 1647258 733.2 ns/op 456 B/op 9 allocs/op BenchmarkRegexp5 456652 2641 ns/op 1016 B/op 23 allocs/op BenchmarkRegexp10 251998 4780 ns/op 1680 B/op 35 allocs/op PASS ok github.com/bmf-san/goblin 14.304s # After go test -bench=. -cpu=1 -benchmem -count=1 goos: darwin goarch: amd64 pkg: github.com/bmf-san/goblin cpu: VirtualApple @ 2.50GHz BenchmarkStatic1 10310658 117.7 ns/op 32 B/op 1 allocs/op BenchmarkStatic5 4774347 258.1 ns/op 96 B/op 1 allocs/op BenchmarkStatic10 2816960 435.8 ns/op 176 B/op 1 allocs/op BenchmarkWildCard1 1867770 653.4 ns/op 384 B/op 6 allocs/op BenchmarkWildCard5 496778 2484 ns/op 928 B/op 13 allocs/op BenchmarkWildCard10 258508 4538 ns/op 1560 B/op 19 allocs/op BenchmarkRegexp1 1978704 608.4 ns/op 384 B/op 6 allocs/op BenchmarkRegexp5 519240 2394 ns/op 928 B/op 13 allocs/op BenchmarkRegexp10 280741 4309 ns/op 1560 B/op 19 allocs/op PASS ok github.com/bmf-san/goblin 13.666s

Comparing before and after improvements, we can say that the overall trend is toward improvement.

Profiling

pprof's graph.

pprof's top.

You can see that the bottleneck has moved to strings.FieldsFunc, which is called in the explodePath.

Further Improvements

Here is the tag released after making other improvements to goblin.
6.0.0

Unfortunately, there are no significant improvements in the data structure and algorithms, so to speak, so there are no remarkable improvements.

I have a feeling that the data structure and algorithms they are using now are still difficult to use. (I have seen other routers that use more advanced trees, so I guess that's true...)

This is somewhat off topic, but I created a bench marker to compare with other routers and see if I can get some hints for improvement.

bmf-san/go-router-benchmark

It was interesting to compare them and see how ragged they are.

I would like to improve it by studying other router implementations, learning about advanced tree structures, etc., which I failed to do before.

Summary

Go makes benchmarking and profiling easy.
Don't guess, measure!
It is hard to get big results with small improvements (that's true)

DEV Community

Improving the performance of your code starting with Go

Improving the performance of your code starting with Go

Prerequisites

Approach to Performance Improvement

Benchmarks

Profiling

Practice

Benchmarks

Profiling

Tuning

Measuring Effectiveness

Benchmarks

Profiling

Further Improvements

Summary

Reference

Top comments (0)