What causes Go's 4x performance loss on this array access microbenchmark (relative to GCC)?-Golang-php.cn

What causes Go's 4x performance loss on this array access microbenchmark (relative to GCC)?

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2024-02-10 08:51:09

forward

704 people have browsed it

在这个数组访问微基准测试中（相对于 GCC），Go 的性能损失了 4 倍，是什么原因造成的？

In this array access microbenchmark (relative to GCC), Go suffers a 4x performance loss. What causes this? This issue involves many aspects such as the runtime mechanism and compiler optimization of the Go language. First of all, the Go language uses a bounds check mechanism when accessing arrays, that is, bounds checks are performed every time an array element is accessed, which will cause a certain performance loss. Secondly, the Go language compiler is relatively weak in optimization and cannot optimize array access well. In addition, the garbage collection mechanism of the Go language will also have a certain impact on performance. Taken together, these factors combined to cause Go to suffer a 4x performance loss in the array access microbenchmark.

Question content

I wrote this microbenchmark to better understand go's performance characteristics so that I can make informed choices about when to use it.

From a performance overhead perspective, I think this would be the ideal scenario for go:

No allocation/release inside the loop
Array access is obviously within bounds (bounds check can be removed)

Nonetheless, I saw a 4x speed difference relative to gcc -o3 on amd64. why is that?

(Use shell timing. It takes a few seconds each time, so startup can be ignored)

package main

import "fmt"

func main() {
    fmt.println("started");

    var n int32 = 1024 * 32

    a := make([]int32, n, n)
    b := make([]int32, n, n)

    var it, i, j int32

    for i = 0; i < n; i++ {
        a[i] =  i
        b[i] = -i
    }

    var r int32 = 10
    var sum int32 = 0

    for it = 0; it < r; it++ {
        for i = 0; i < n; i++ {
            for j = 0; j < n; j++ {
                sum += (a[i] + b[j]) * (it + 1)
            }
        }
    }
    fmt.printf("n = %d, r = %d, sum = %d\n", n, r, sum)
}

Copy after login

c Version:

#include <stdio.h>
#include <stdlib.h>


int main() {
    printf("started\n");

    int32_t n = 1024 * 32;

    int32_t* a = malloc(sizeof(int32_t) * n);
    int32_t* b = malloc(sizeof(int32_t) * n);

    for(int32_t i = 0; i < n; ++i) {
        a[i] =  i;
        b[i] = -i;
    }

    int32_t r = 10;
    int32_t sum = 0;

    for(int32_t it = 0; it < r; ++it) {
        for(int32_t i = 0; i < n; ++i) {
            for(int32_t j = 0; j < n; ++j) {
                sum += (a[i] + b[j]) * (it + 1);
            }
        }
    }
    printf("n = %d, r = %d, sum = %d\n", n, r, sum);

    free(a);
    free(b);
}

Copy after login

renew:

Use range as recommended to increase go speed by 2 times.
On the other hand, -march=native made c 2x faster in my tests. (And -mno-sse gives a compilation error, apparently incompatible with -o3)
gccgo looks equivalent to gcc here (and doesn't require range)

Solution

Look at the assembler output of the C program and the Go program, at least on the versions of Go and GCC I am using (1.19.6 and 12.2.0 respectively) , the most direct and obvious difference is that GCC automatically vectorizes C programs, while the Go compiler seems unable to do this.

This also nicely explains why you would see a quadruple performance increase, as GCC uses SSE instead of AVX when not targeting a specific architecture, which means the 32-bit scalar instruction width is four times the operating width. In fact, adding -march=native gave me a 2x performance improvement because it made GCC output AVX code on my CPU.

I'm not familiar enough with Go to tell you whether the Go compiler is intrinsically unable to do autovectorization, or if it's just this particular program that's causing it to error for some reason, but that seems to be the root cause.

The above is the detailed content of What causes Go's 4x performance loss on this array access microbenchmark (relative to GCC)?. For more information, please follow other related articles on the PHP Chinese website!