GPU 模式講座 1 的筆記-Python教學-PHP中文網

GPU 模式講座 1 的筆記

DDD

發布： 2024-11-17 19:21:02

原創

1091 人瀏覽過

Notes on GPU-Mode lecture 1

分析器

電腦效能取決於時間和記憶體的權衡。由於計算設備比較昂貴，所以大多數時候，時間是首先要關心的。

為什麼要使用分析器？

CUDA 是異步的，因此無法使用 Python 時間模組
分析器更強大

工具

共有三個分析器：

autograd 分析器：數值
Pytorch 分析器：視覺
NVIDIA Nsight 計算

Autograd 分析器利用 torch.cuda.Event() 來測量效能。

PyTorch profiler 利用 Profiler 上下文管理器 torch.profiler 中的 profile() 方法來分析效能。
您可以將結果匯出為 .json 檔案並將其上傳到 chrome://tracing/ 進行視覺化。

示範

課程提供了一個簡單的程式來展示如何使用autograd profiler來分析三種平方運算方法的表現：

透過 torch.square()
由 ** 操作員
由 * 操作員

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

登入後複製

下面的結果是在 NVIDIA T4 GPU 上完成的。

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

登入後複製

事實證明：

CUDA 運算速度比 CPU 更快。
* 運算子執行的是 aten::multiply 操作，而不是 aten::pow，且前者更快。這可能是因為乘法比 pow 使用得更多，而且許多開發人員花時間進行最佳化。
CUDA 上的效能差異很小。考慮到 CPU 時間，torch.square 是最慢的操作
aten::square 是對 aten::pow 的調用
所有三種方法都啟動了一個名為 native::vectorized_elementwise_kernel

在 PyTorch 中整合 CUDA 內核

有幾種方法可以做到這一點：

使用torch.utils.cpp_extendsion中的load_inline
使用 Numba，它是一個編譯器，可將經過修飾的 Python 函數編譯為在 CPU 和 GPU 上運行的機器碼
使用 Triton

我們可以使用torch.utils.cpp_extendsion中的load_inline透過load_inline（name，cpp_sources，cuda_sources，functions，with_cuda，build_directory）將CUDA核心載入為PyTorch擴充。

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

登入後複製

動手實踐

對均值操作使用 autograd 分析器

使用 autograd profiler 時，請記住：

錄製前預熱GPU，使GPU進入穩定狀態
平均多次運行以獲得更可靠的結果

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

登入後複製

使用 Pytorch 分析器進行平均值操作

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

登入後複製

為 torch.mean() 實作 triton 程式碼

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

登入後複製