Example of CPU cache false-sharing in Go.
A simple example where 2 integer variables are incremented concurrently.
Baseline version suffers from false-sharing due to values share same cache line:
type IntVars struct {
i1 int64
i2 int64
}
Optimized version eliminates false sharing by introducing padding between memory locations:
type IntVars struct {
i1 int64
_ cpu.CacheLinePad // padding
i2 int64
}
Reproducing results
- Install benchstat:
go install golang.org/x/perf/cmd/benchstat@latest
- Run benchmarks for simple increments:
a++
:
▶ make bench
name old time/op new time/op delta
Increment1Value-2 1.66ns ± 8% 1.71ns ± 7% ~ (p=0.421 n=5+5)
Increment2ValuesInParallel-2 2.34ns ± 5% 1.59ns ± 3% -32.23% (p=0.008 n=5+5)
- Run benchmarks for atomic increments:
atomic.AddInt64(addr, 1)
:
▶ make bench-atomic
name old time/op new time/op delta
Increment1Value-2 5.65ns ± 5% 5.85ns ± 6% ~ (p=0.310 n=5+5)
Increment2ValuesInParallel-2 41.6ns ±10% 5.4ns ± 8% -87.12% (p=0.008 n=5+5)
CPU cache miss stats
On Linux, one can measure L1 cache misses to demonstrate false-sharing.
- Build executables for both original and optimized versions:
▶ make build
GOOS=linux GOARCH=amd64 go build -o test
GOOS=linux GOARCH=amd64 go build -tags padded -o test-padded
- Run perf for both executables and compare numbers:
▶ perf stat -B -e L1-dcache-load-misses ./test
Performance counter stats for './test':
8,954,010 L1-dcache-load-misses
▶ perf stat -B -e L1-dcache-load-misses ./test-padded
Performance counter stats for './test-padded':
204,287 L1-dcache-load-misses