A massive array of arithmetic circuits powers NVIDIA GPUs to enable unprecedented acceleration of AI, high-performance computing, and computer graphics. Therefore, improving the design of these arithmetic circuits is critical to improving GPU performance and efficiency. What if AI learned to design these circuits? In a recent NVIDIA paper, "PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning," researchers demonstrated that AI can not only design these circuits from scratch, but also that AI-designed circuits are better than those designed by state-of-the-art electronic design automation (EDA) tools. Circuits are smaller and faster.
##Paper address: https://arxiv.org/pdf/2205.07000.pdf
The latest NVIDIA Hopper GPU architecture has nearly 13,000 AI-designed circuit examples. Figure 1 below: The 64b adder circuit designed by PrefixRL AI on the left is 25% smaller than the circuit designed by the most advanced EDA tool on the right of Figure 1.
Circuit Design OverviewArithmetic circuits in computer chips are composed of networks of logic gates such as NAND, NOR and XOR) and wires. An ideal circuit should have the following attributes:
In this NVIDIA study, researchers focused on circuit area and latency. They found that power consumption was closely related to the area of the circuit of interest. Circuit area and delay are often competing properties, so it is desirable to find a Pareto frontier for a design that effectively trades off these properties. In short, the researchers hope that the circuit area is minimized at each delay.
Therefore, in PrefixRL, researchers focus on a popular class of arithmetic circuits—parallel prefix circuits. Various important circuits in the GPU such as accelerators, increments, and encoders are prefix circuits, and they can be designated as prefix graphs at a higher level.
Then the question is: Can AI agents design good prefix maps? The state space of all prefix graphs is very large O(2^n^n) and cannot be explored using brute force methods. Figure 2 below shows an iteration of PrefixRL with a 4b circuit instance.
The researchers used Circuit Generator to convert the prefix diagram into a circuit with wires and logic gates. Next, these generated circuits are optimized through a physical synthesis tool that uses physical synthesis optimizations such as gate size, duplication, and buffer insertion.
Due to these physical synthesis optimizations, the final circuit properties (delay, area, and power) are not directly converted from the original prefix graph properties (such as levels and node count). This is why the AI agent learns to design prefix graphs but optimizes the properties of the final circuit generated from the prefix graphs.
Researchers treat arithmetic circuit design as a reinforcement learning (RL) task, in which an agent is trained to optimize the arithmetic circuit Area and delay properties. For the prefix circuit, they designed an environment where the RL agent can add or remove nodes in the prefix graph, and then perform the following steps:
In the following animation, the RL agent builds the prefix graph step by step by adding or deleting nodes. At each step, the agent is rewarded with improvements in circuit area and latency.
#The original image is an interactive version.
The researchers use the Q-learning (Q-learning) algorithm to train the circuit design of the agent. As shown in Figure 3 below, they decompose the prefix graph into a grid representation, where each element in the grid is uniquely mapped to a prefix node. This grid represents the inputs and outputs used for the Q-network. Each element in the input grid represents whether the node exists or not. Each element in the output grid represents the Q-value of adding or removing a node.
The researcher uses a fully convolutional neural network architecture because the input and output of the Q learning agent are grid representations. The agent predicts Q-values for the area and delay attributes separately because the rewards for area and delay are separately observable during training.
Figure 3: 4b prefix graph representation (left) and fully convolutional Q-learning agent architecture (right).
PrefixRL requires a lot of calculations. In the physics simulation, each GPU requires 256 CPUs, and training 64b tasks requires Over 32,000 GPU hours. This time, NVIDIA has developed an internal distributed reinforcement learning platform, Raptor, which takes full advantage of NVIDIA's hardware advantages and can perform this kind of industrial-level reinforcement learning (Figure 4 below).
Raptor has features that improve the scalability and speed of training models, such as job scheduling, custom networks, and GPU-aware data structures. In the context of PrefixRL, Raptor enables hybrid allocation across CPUs, GPUs, and Spot Instances. The networks in this reinforcement learning application are diverse and benefit from the following:
Finally, Raptor provides GPU-aware data structures such as replay buffers with multi-threaded services to receive experiences from multiple workers, batch data in parallel and Preload it on the GPU.
Figure 4 below shows that the PrefixRL framework supports concurrent training and data collection, and utilizes NCCL to efficiently send the latest parameters to participants (actors in the figure below).
Figure 4: Researchers use Raptor for decoupled parallel training and reward calculation to overcome circuit synthesis delays.
The researchers use a trade-off weight w (range is [0,1]) to combine the area and delay goals. They train various agents with different weights to obtain the Pareto frontier, thereby balancing the area, delay trade-off.
Physically synthesized optimization in a RL environment can generate a variety of solutions that trade off area and latency. Researchers drive physical synthesis tools using the same trade-off weights used to train specific agents.
Performing physics-synthesized optimization within a loop of reward calculations has the following advantages:
However, doing physical synthesis is a slow process (~35 seconds for 64b adder), which can significantly slow down RL training and exploration.
The researchers decouple reward calculation from state updates because the agent only needs the current prefix graph state to take action, without circuit synthesis or previous rewards. Thanks to Raptor, they can offload lengthy reward calculations to a pool of CPU workers to perform physics synthesis in parallel, while actor agents can execute in the environment without waiting.
When the CPU worker returns the reward, the transformation can be embedded in the replay buffer. Comprehensive rewards are cached to avoid redundant calculations when a state is encountered again.
Figure 5 below shows the area and delay of a 64b adder circuit designed using PrefixRL and the Pareto-dominated adder circuit from the most advanced EDA tools.
The best PrefixRL adders achieve 25% less area than EDA tool adders at the same latency. These prefix graphs mapped to Pareto optimal adder circuits after physical synthesis optimization have irregular structures.
Figure 5: Arithmetic circuits designed by PrefixRL are smaller than circuits designed by state-of-the-art EDA tools and faster.
(left) circuit architecture; (right) corresponding 64b adder circuit characteristics diagram
As far as we know, this is the first method to use deep reinforcement learning agents to design arithmetic circuits. NVIDIA envisions a blueprint for applying AI to real-world circuit design problems, building action spaces, state representations, RL agent models, optimizing against multiple competing goals, and overcoming slow reward calculations.
The above is the detailed content of NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient. For more information, please follow other related articles on the PHP Chinese website!