As we all know, when dealing with deep learning and neural network tasks, it is better to use a GPU instead of a CPU, because when it comes to neural networks, even a relatively low-end GPU will outperform a CPU.
Deep learning is a field that requires a lot of computing. To a certain extent, the choice of GPU will fundamentally determine the deep learning experience.
But here comes the problem, how to choose the right GPU is also a headache and brain-burning thing.
How to avoid being in trouble and how to make a cost-effective choice?
Tim Dettmers, a well-known evaluation blogger who has received PhD offers from Stanford, UCL, CMU, NYU, and UW and is currently studying for a PhD at the University of Washington, discusses what kind of GPU is needed in the field of deep learning. , combined with his own experience, wrote a long article of 10,000 words, and finally gave a recommended GPU in the DL field.
Tim Dettmers’s research direction is deep learning of representation learning and hardware optimization. He created it himself The website is also well-known in the fields of deep learning and computer hardware.
The GPUs recommended by Tim Dettmers in this article are all from N Factory. He obviously also believes that AMD is not worthy of having a name when it comes to machine learning.
The editor has also posted the original link below.
Original link: https://timdettmers.com/2023/01/16/which-gpu-for-deep-learning /#GPU_Deep_Learning_Performance_per_Dollar
Compared with the NVIDIA Turing architecture RTX 20 series, the new NVIDIA Ampere architecture RTX 30 series Has more advantages such as sparse network training and inference. Other features, such as new data types, should be viewed more as ease-of-use features, as they provide the same performance improvements as the Turing architecture but do not require any additional programming requirements.
The Ada RTX 40 series has even more advancements, such as the Tensor Memory Accelerator (TMA) and 8-bit floating point operations (FP8) introduced above. The RTX 40 series has similar power and temperature issues compared to the RTX 30. The issue with the RTX 40's melted power connector cable can be easily avoided by connecting the power cable correctly.
Ampere allows automatic sparse matrix multiplication of fine-grained structures at dense speeds. How is this done? Take a weight matrix as an example and cut it into pieces with 4 elements. Now imagine that 2 of these 4 elements are zero. Figure 1 shows what this situation looks like.
Figure 1: Structures supported by the sparse matrix multiplication function in Ampere architecture GPU
When you multiply this sparse weight matrix with some dense input, Ampere's Sparse Matrix Tensor core functionality automatically compresses the sparse matrix into a dense representation that is half the size shown in Figure 2.
After compression, the densely compressed matrix tiles are fed into the tensor core, which computes matrix multiplications twice the usual size. This effectively yields 2x speedup because the bandwidth requirements are halved during matrix multiplication in shared memory.
Figure 2: Sparse matrices are compressed into dense representation before matrix multiplication.
I work on sparse network training in my research, and I also wrote a blog post about sparse training. One criticism of my work was: "You reduce the FLOPS required by the network, but don't produce a speedup because GPUs can't do fast sparse matrix multiplications".
With the addition of Tensor Cores' sparse matrix multiplication capabilities, my algorithm, or other sparse training algorithms, now actually provides up to 2x speedup during training.
The sparse training algorithm developed has three stages: (1) Determine the importance of each layer. (2) Remove the least important weights. (3) Promote new weights proportional to the importance of each layer.
While this feature is still experimental and training sparse networks is not yet common, having this feature on your GPU means you are already trained for sparse Prepare for the future.
In my work, I have previously shown that new data types can improve low precision during backpropagation stability.
Figure 4: Low-precision deep learning 8-bit data type. Deep learning training benefits from highly specialized data types The ordinary FP16 data type only supports numbers in the range [-65,504, 65,504]. If your gradient slips past this range, your gradient will explode into NaN values.
To prevent this situation in FP16 training, we usually do loss scaling, that is, multiply the loss by a small number before backpropagation to prevent this gradient explosion .
Brain Float 16 format (BF16) uses more bits for the exponent so that the range of possible numbers is the same as FP32, BF16 has less precision, that is, significant digits, but the gradient Accuracy is not that important for learning.
So what BF16 does is you no longer need to do any loss scaling, and you don't need to worry about gradients exploding quickly. Therefore, we should see an improvement in the stability of training by using the BF16 format, as there is a slight loss in accuracy.
What does this mean to you. Using BF16 precision, training is likely to be more stable than using FP16 precision while providing the same speed increase. With TF32 precision, you get stability close to FP32 while providing speed improvements close to FP16.
The good thing is that to use these data types, you only need to replace FP32 with TF32 and FP16 with BF16--no code changes required.
But in general, these new data types can be considered lazy data types, because you can get rid of them with some extra programming effort (proper loss scaling, initialization, Normalize, use Apex) to get all the benefits of the old data types.
Thus, these data types do not provide speed, but rather improve ease of use with low precision in training.
Fan Design and GPU Temperature
If your GPU gets hotter than 80C, it will self-throttle, slowing down its computing speed/power. The solution to this problem is to use a PCIe extender to create space between the GPUs.
Spreading the GPUs with PCIe extenders is very effective for cooling, other PhD students at the University of Washington and I have used this setup with great success. It doesn't look pretty, but it keeps your GPU cool!
The following system has been running for 4 years without any problems. This can also be used if you don't have enough space to fit all the GPUs in the PCIe slots.
Figure 5: A 4-graphics card system with PCIE expansion ports looks like a mess, but the heat dissipation efficiency is very high.
It is possible to set a power limit on your GPU. As a result, you'll be able to programmatically set the RTX 3090's power limit to 300W instead of its standard 350W. In a 4-GPU system, this equates to a saving of 200W, which may be just enough to make a 4x RTX 3090 system feasible with a 1600W PSU.
This also helps keep the GPU cool. Therefore, setting a power limit solves both the main problems of a 4x RTX 3080 or 4x RTX 3090 setup, cooling and power. For a 4x setup, you still need an efficient cooling fan for the GPU, but this solves the power issue.
Figure 6: Reducing the power limit has a slight cooling effect. Lowering the power limit of the RTX 2080 Ti by 50-60W results in slightly lower temperatures and quieter fan operation
You may ask, "Won't this slow down the GPU? ?” Yes, it will indeed fall, but the question is how much.
I benchmarked the 4x RTX 2080 Ti system shown in Figure 5 at different power limits. I benchmarked the time for 500 mini-batches of BERT Large during inference (excluding softmax layer). Choosing BERT Large inference puts the greatest pressure on the GPU.
Figure 7: Measured speed drop at a given power limit on RTX 2080 Ti
We can see that setting a power limit does not seriously affect performance. Limiting the power to 50W only reduces performance by 7%.
There is a misunderstanding that the RTX 4090 power cord catches fire because it is excessively bent. This is actually the case for only 0.1% of users, and the main problem is that the cable is not plugged in correctly.
Therefore, it is completely safe to use the RTX 4090 if you follow the installation instructions below.
1. If you are using an old cable or an old GPU, make sure the contacts are free of debris/dust.
2. Use the power connector and plug it into the outlet until you hear a click - this is the most important part.
3. Test the fit by twisting the cord from left to right. The cable should not move.
4. Visually check the contact with the socket and there is no gap between the cable and the socket.
Support for 8-bit floating point (FP8) is RTX 40 series and H100 A huge advantage for GPUs.
With 8-bit input, it allows you to load data for matrix multiplication twice as fast, and you can store twice as many matrix elements in the cache as in Ada and Hopper architectures , the cache is very large, and now with FP8 tensor cores, you can get 0.66 PFLOPS of compute for the RTX 4090.
This is higher than the entire computing power of the world’s fastest supercomputer in 2007. The RTX 4090 has 4 times the FP8 calculations and is comparable to the world’s fastest supercomputer in 2010.
As can be seen, the best 8-bit baseline fails to provide good zero-point performance. The method I developed, LLM.int8(), can do Int8 matrix multiplication with the same results as the 16-bit baseline.
But Int8 is already supported by RTX 30/A100/Ampere generation GPUs. Why is FP8 another big upgrade in RTX 40? The FP8 data type is much more stable than the Int8 data type and is easy to use in layer specifications or non-linear functions, which is difficult to do with the integer data type.
This will make its use in training and inference very simple and straightforward. I think this will make FP8 training and inference relatively commonplace in a few months.
Below you can see a relevant main result from this paper about the Float vs Integer data type. We can see that bit by bit, the FP4 data type retains more information than the Int4 data type, thereby improving the average LLM zero-point accuracy across the 4 tasks.
Let’s take a look at the original performance ranking of GPU and see who is the best beat.
We can see a huge gap between the 8-bit performance of the H100 GPU and older cards optimized for 16-bit performance.
The above figure shows the raw relative performance of the GPU. For example, for 8-bit inference, the performance of the RTX 4090 is approximately 0.33 times that of the H100 SMX.
In other words, the H100 SMX is three times faster at 8-bit inference compared to the RTX 4090.
For this data, he did not model 8-bit computing for older GPUs.
Because 8-bit inference and training are more efficient on Ada/Hopper GPUs, and the Tensor Memory Accelerator (TMA) saves a lot of registers that are very accurate in 8-bit matrix multiplication .
Ada/Hopper also has FP8 support, which makes especially 8-bit training more efficient. On Hopper/Ada, 8-bit training performance is likely to be 3-4 times that of 16-bit training. times.
For old GPUs, the Int8 inference performance of old GPUs is close to 16-bit inference performance.
Then the problem is, the GPU performance is strong but I can’t afford it...
For those who don’t have enough budget, the following chart is his performance per dollar ranking (Performance per Dollar) based on the price and performance statistics of each GPU, which reflects the cost-effectiveness of the GPU.
Selecting a GPU that completes deep learning tasks and meets the budget can be divided into the following steps:
We can see that the RTX4070Ti is the most cost-effective for 8-bit and 16-bit inference, while the RTX3080 is the most cost-effective for 16-bit training.
Although these GPUs are the most cost-effective, their memory is also a shortcoming, and 10GB and 12GB of memory may not meet all needs.
But it may be an ideal GPU for novices who are new to deep learning.
Some of these GPUs are great for Kaggle competitions. To do well in Kaggle competitions, working method is more important than model size, so many smaller GPUs are well suited.
Kaggle is known as the world's largest gathering platform for data scientists, with experts gathered here, and it is also very friendly to newbies.
The best GPU if used for academic research and server operations seems to be the A6000 Ada GPU.
At the same time, H100 SXM is also very cost-effective, with large memory and strong performance.
Speaking from personal experience, if I were to build a small cluster for a corporate/academic lab, I would recommend 66-80% A6000 GPU and 20-33% H100 SXM GPU.
Having said so much, we finally come to the GPU Amway section.
Tim Dettmers specially created a "GPU purchase flow chart". If you have enough budget, you can go for a higher configuration. If you don't have enough budget, please refer to the cost-effective choice.
The first thing to emphasize here is: no matter which GPU you choose, first make sure that its memory can meet your needs. To do this, you have to ask yourself a few questions:
What do I want to do with the GPU? Is it used to participate in Kaggle competitions, learn deep learning, do CV/NLP research, or play small projects?
If you have enough budget, you can check out the benchmarks above and choose the best GPU for you.
You can also estimate the GPU memory required by running your problem in vast.ai or Lambda Cloud for a period of time to understand whether it will meet your needs.
If you only need a GPU occasionally (for a few hours every few days) and don't need to download and process large datasets, vast.ai or Lambda Cloud will also work well Work.
However, if the GPU is used every day for a month and the usage frequency is high (12 hours a day), cloud GPU is usually not a good choice.
The above is the detailed content of Deep Learning GPU Selection Guide: Which graphics card is worthy of my alchemy furnace?. For more information, please follow other related articles on the PHP Chinese website!