Profiler
Computer performance is about time and memory trade-offs. Since calculating devices are way more expensive, most of the time, time is the priority to care about.
Why use a profiler?
- CUDA is async so can't use the Python time module
- Profilers are way more powerful
Tools
There are three profilers:
- autograd profiler: numerical
- Pytorch profiler: visual
- NVIDIA Nsight Compute
Autograd profiler utilizes torch.cuda.Event() to measure performance.
PyTorch profiler utilizes the method profile() from the Profiler context manager torch.profiler to analyze performance.
You can export the result as a .json file and upload it to chrome://tracing/ to visualize it.
Demo
The course provides a simple program to show how to use autograd profiler to analyze the performance of three ways to do square operations:
- by torch.square()
- by ** operator
- by * operator
The result below is done on the NVIDIA T4 GPU.
It turns out:
- CUDA operation is faster than CPU.
- The * operator is doing an aten::multiply operation rather than an aten::pow, and the former is faster. It is probably because that multiply is used more than pow and many developers spend time on optimizing it.
- The performance difference on CUDA is minimal. torch.square is the slowest operation considering the CPU time
-
aten::square is a call to aten::pow
- All three methods launched a cuda kernel called native::vectorized_elementwise_kernel<4, at...
Integrating CUDA kernels in PyTorch
There are a couple of ways to do that:
- use load_inline from torch.utils.cpp_extendsion
- use Numba which is a compiler that compiles a decorated Python function into the machine code that runs on both CPU and GPU
- use Triton
We can use load_inline from torch.utils.cpp_extendsion to load the CUDA kernel as a PyTorch extension by load_inline(name, cpp_sources, cuda_sources, functions, with_cuda, build_directory).
Hands-on
Use autograd profiler on mean operation
When using autograd profiler, remember:
- Warmup the GPU before recording so that the GPU enters a steady state
- Average multiple runs for more reliable results
Use Pytorch profiler on mean operation
Implementing triton code for torch.mean()
Reference
- gpu-mode lectures - Github
- Event - Pytorch
- PyTorch Profiler
- NVIDIA Nsight Compute
- torch.utils.cpp_extension.load_inline
- Triton
The above is the detailed content of Notes on GPU-Mode lecture 1. For more information, please follow other related articles on the PHP Chinese website!