GPU-Mode 講義の注意事項 1-Python チュートリアル-php.cn

GPU-Mode 講義の注意事項 1

DDD

リリース： 2024-11-17 19:21:02

オリジナル

1035 人が閲覧しました

Notes on GPU-Mode lecture 1

プロファイラー

コンピュータのパフォーマンスは、時間とメモリのトレードオフによって決まります。計算デバイスは非常に高価であるため、ほとんどの場合、時間を最優先に考慮する必要があります。

プロファイラーを使用する理由

CUDA は非同期であるため、Python 時間モジュールを使用できません
プロファイラーはさらに強力です

ツール

3 つのプロファイラーがあります:

autograd プロファイラー: 数値
Pytorch プロファイラー: ビジュアル
NVIDIA Nsight コンピューティング

Autograd プロファイラーは torch.cuda.Event() を利用してパフォーマンスを測定します。

PyTorch プロファイラーは、プロファイラーコンテキストマネージャー torch.profiler のメソッド profile() を利用してパフォーマンスを分析します。
結果を .json ファイルとしてエクスポートし、chrome://tracing/ にアップロードして視覚化できます。

デモ

このコースでは、autograd プロファイラーを使用して 2 乗演算を実行する 3 つの方法のパフォーマンスを分析する方法を示す簡単なプログラムを提供します。

作成者: torch.square()
** オペレーターによる
by * 演算子

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

ログイン後にコピー

以下の結果は NVIDIA T4 GPU で実行されます。

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

ログイン後にコピー

結果は次のとおりです。

CUDA の動作は CPU よりも高速です。
* 演算子は、aten::pow ではなく aten::multiply 演算を実行しており、前者の方が高速です。それはおそらく、乗算が pow よりも多く使用され、多くの開発者がその最適化に時間を費やしているためです。
CUDA でのパフォーマンスの違いは最小限です。 torch.square は CPU 時間を考慮すると最も遅い操作です
aten::square は aten::pow の呼び出しです
3 つのメソッドはすべて、native::vectorized_elementwise_kernel<4 という名前の cuda カーネルを起動しました...

PyTorch への CUDA カーネルの統合

これを行うには、いくつかの方法があります:

torch.utils.cpp_extendsion のload_inline を使用する
装飾された Python 関数を CPU と GPU の両方で実行されるマシンコードにコンパイルするコンパイラーである Numba を使用します
トリトンを使用してください

torch.utils.cpp_extendsion のload_inline を使用して、load_inline(name, cpp_sources, cuda_sources, function, with_cuda, build_directory) によって CUDA カーネルを PyTorch 拡張機能としてロードできます。

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

ログイン後にコピー

実践

平均操作で autograd プロファイラを使用する

autograd プロファイラーを使用する場合は、次の点に注意してください。

GPU が定常状態になるように、録画前に GPU をウォームアップします
複数回の実行を平均して、より信頼性の高い結果を得る

def time_pytorch_function(func, input):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

ログイン後にコピー

平均操作で Pytorch プロファイラーを使用する

Profiling torch.square:
Self CPU time total: 10.577ms
Self CUDA time total: 3.266ms

Profiling a * a:
Self CPU time total: 5.417ms
Self CUDA time total: 3.276ms

Profiling a ** 2:
Self CPU time total: 6.183ms
Self CUDA time total: 3.274ms

ログイン後にコピー

torch.mean() の triton コードの実装

from torch.utils.cpp_extension import load_inline

square_matrix_extension = load_inline(
    name='square_matrix_extension',
    cpp_sources=cpp_source,
    cuda_sources=cuda_source,
    functions=['square_matrix'],
    with_cuda=True,
    extra_cuda_cflags=["-O2"],
    build_directory='./load_inline_cuda',
    # extra_cuda_cflags=['--expt-relaxed-constexpr']
)

a = torch.tensor([[1., 2., 3.], [4., 5., 6.]], device='cuda')
print(square_matrix_extension.square_matrix(a))

ログイン後にコピー