In this array access microbenchmark (relative to GCC), Go suffers a 4x performance loss. What causes this? This issue involves many aspects such as the runtime mechanism and compiler optimization of the Go language. First of all, the Go language uses a bounds check mechanism when accessing arrays, that is, bounds checks are performed every time an array element is accessed, which will cause a certain performance loss. Secondly, the Go language compiler is relatively weak in optimization and cannot optimize array access well. In addition, the garbage collection mechanism of the Go language will also have a certain impact on performance. Taken together, these factors combined to cause Go to suffer a 4x performance loss in the array access microbenchmark.
I wrote this microbenchmark to better understand go's performance characteristics so that I can make informed choices about when to use it.
From a performance overhead perspective, I think this would be the ideal scenario for go:
Nonetheless, I saw a 4x speed difference relative to gcc -o3
on amd64. why is that?
(Use shell timing. It takes a few seconds each time, so startup can be ignored)
package main import "fmt" func main() { fmt.println("started"); var n int32 = 1024 * 32 a := make([]int32, n, n) b := make([]int32, n, n) var it, i, j int32 for i = 0; i < n; i++ { a[i] = i b[i] = -i } var r int32 = 10 var sum int32 = 0 for it = 0; it < r; it++ { for i = 0; i < n; i++ { for j = 0; j < n; j++ { sum += (a[i] + b[j]) * (it + 1) } } } fmt.printf("n = %d, r = %d, sum = %d\n", n, r, sum) }
c Version:
#include <stdio.h> #include <stdlib.h> int main() { printf("started\n"); int32_t n = 1024 * 32; int32_t* a = malloc(sizeof(int32_t) * n); int32_t* b = malloc(sizeof(int32_t) * n); for(int32_t i = 0; i < n; ++i) { a[i] = i; b[i] = -i; } int32_t r = 10; int32_t sum = 0; for(int32_t it = 0; it < r; ++it) { for(int32_t i = 0; i < n; ++i) { for(int32_t j = 0; j < n; ++j) { sum += (a[i] + b[j]) * (it + 1); } } } printf("n = %d, r = %d, sum = %d\n", n, r, sum); free(a); free(b); }
renew:
range
as recommended to increase go speed by 2 times. -march=native
made c 2x faster in my tests. (And -mno-sse
gives a compilation error, apparently incompatible with -o3
) range
) Look at the assembler output of the C program and the Go program, at least on the versions of Go and GCC I am using (1.19.6 and 12.2.0 respectively) , the most direct and obvious difference is that GCC automatically vectorizes C programs, while the Go compiler seems unable to do this.
This also nicely explains why you would see a quadruple performance increase, as GCC uses SSE instead of AVX when not targeting a specific architecture, which means the 32-bit scalar instruction width is four times the operating width. In fact, adding -march=native
gave me a 2x performance improvement because it made GCC output AVX code on my CPU.
I'm not familiar enough with Go to tell you whether the Go compiler is intrinsically unable to do autovectorization, or if it's just this particular program that's causing it to error for some reason, but that seems to be the root cause.
The above is the detailed content of What causes Go's 4x performance loss on this array access microbenchmark (relative to GCC)?. For more information, please follow other related articles on the PHP Chinese website!