Table of Contents
Fully convolutional Q-learning agent
Raptor for distributed training
Reward Calculation
Results and Outlook
Home Technology peripherals AI NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

Apr 12, 2023 pm 09:13 PM
gpu ai Nvidia

NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

A massive array of arithmetic circuits powers NVIDIA GPUs to enable unprecedented acceleration of AI, high-performance computing, and computer graphics. Therefore, improving the design of these arithmetic circuits is critical to improving GPU performance and efficiency. What if AI learned to design these circuits? In a recent NVIDIA paper, "PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning," researchers demonstrated that AI can not only design these circuits from scratch, but also that AI-designed circuits are better than those designed by state-of-the-art electronic design automation (EDA) tools. Circuits are smaller and faster.

NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

##Paper address: https://arxiv.org/pdf/2205.07000.pdf

The latest NVIDIA Hopper GPU architecture has nearly 13,000 AI-designed circuit examples. Figure 1 below: The 64b adder circuit designed by PrefixRL AI on the left is 25% smaller than the circuit designed by the most advanced EDA tool on the right of Figure 1.

NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

Circuit Design Overview

Arithmetic circuits in computer chips are composed of networks of logic gates such as NAND, NOR and XOR) and wires. An ideal circuit should have the following attributes:

  • Small: smaller area, more circuits can be packaged on the chip;
  • Fast: lower latency, improved chip performance;
  • lower power consumption.

In this NVIDIA study, researchers focused on circuit area and latency. They found that power consumption was closely related to the area of ​​the circuit of interest. Circuit area and delay are often competing properties, so it is desirable to find a Pareto frontier for a design that effectively trades off these properties. In short, the researchers hope that the circuit area is minimized at each delay.
Therefore, in PrefixRL, researchers focus on a popular class of arithmetic circuits—parallel prefix circuits. Various important circuits in the GPU such as accelerators, increments, and encoders are prefix circuits, and they can be designated as prefix graphs at a higher level.

Then the question is: Can AI agents design good prefix maps? The state space of all prefix graphs is very large O(2^n^n) and cannot be explored using brute force methods. Figure 2 below shows an iteration of PrefixRL with a 4b circuit instance.

NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

The researchers used Circuit Generator to convert the prefix diagram into a circuit with wires and logic gates. Next, these generated circuits are optimized through a physical synthesis tool that uses physical synthesis optimizations such as gate size, duplication, and buffer insertion.

Due to these physical synthesis optimizations, the final circuit properties (delay, area, and power) are not directly converted from the original prefix graph properties (such as levels and node count). This is why the AI ​​agent learns to design prefix graphs but optimizes the properties of the final circuit generated from the prefix graphs.

Researchers treat arithmetic circuit design as a reinforcement learning (RL) task, in which an agent is trained to optimize the arithmetic circuit Area and delay properties. For the prefix circuit, they designed an environment where the RL agent can add or remove nodes in the prefix graph, and then perform the following steps:

  1. The prefix graph is normalized to always Maintain correct prefix sum calculations;
  2. Generate circuits from normalized prefix graphs;
  3. Use physical synthesis tools to perform physical synthesis optimization of circuits ;
  4. Measure the area and delay characteristics of the circuit.

In the following animation, the RL agent builds the prefix graph step by step by adding or deleting nodes. At each step, the agent is rewarded with improvements in circuit area and latency.

NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

#The original image is an interactive version.

Fully convolutional Q-learning agent

The researchers use the Q-learning (Q-learning) algorithm to train the circuit design of the agent. As shown in Figure 3 below, they decompose the prefix graph into a grid representation, where each element in the grid is uniquely mapped to a prefix node. This grid represents the inputs and outputs used for the Q-network. Each element in the input grid represents whether the node exists or not. Each element in the output grid represents the Q-value of adding or removing a node.

The researcher uses a fully convolutional neural network architecture because the input and output of the Q learning agent are grid representations. The agent predicts Q-values ​​for the area and delay attributes separately because the rewards for area and delay are separately observable during training.

NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

Figure 3: 4b prefix graph representation (left) and fully convolutional Q-learning agent architecture (right).

Raptor for distributed training

PrefixRL requires a lot of calculations. In the physics simulation, each GPU requires 256 CPUs, and training 64b tasks requires Over 32,000 GPU hours. This time, NVIDIA has developed an internal distributed reinforcement learning platform, Raptor, which takes full advantage of NVIDIA's hardware advantages and can perform this kind of industrial-level reinforcement learning (Figure 4 below).

Raptor has features that improve the scalability and speed of training models, such as job scheduling, custom networks, and GPU-aware data structures. In the context of PrefixRL, Raptor enables hybrid allocation across CPUs, GPUs, and Spot Instances. The networks in this reinforcement learning application are diverse and benefit from the following:

  • Raptor switches between NCCLs for peer-to-peer transfer of models Parameters are transferred directly from the learner GPU to the inference GPU;
  • Redis is used for asynchronous and smaller messages such as rewards or statistics;
  • For JIT compiled RPC, used to handle high-volume and low-latency requests, such as uploading experience data.

Finally, Raptor provides GPU-aware data structures such as replay buffers with multi-threaded services to receive experiences from multiple workers, batch data in parallel and Preload it on the GPU.

Figure 4 below shows that the PrefixRL framework supports concurrent training and data collection, and utilizes NCCL to efficiently send the latest parameters to participants (actors in the figure below).

NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

Figure 4: Researchers use Raptor for decoupled parallel training and reward calculation to overcome circuit synthesis delays.

Reward Calculation

The researchers use a trade-off weight w (range is [0,1]) to combine the area and delay goals. They train various agents with different weights to obtain the Pareto frontier, thereby balancing the area, delay trade-off.

Physically synthesized optimization in a RL environment can generate a variety of solutions that trade off area and latency. Researchers drive physical synthesis tools using the same trade-off weights used to train specific agents.

Performing physics-synthesized optimization within a loop of reward calculations has the following advantages:

  • RL agents learn to directly optimize the final circuit properties of target technology nodes and libraries ;
  • RL agent includes the peripheral logic of the target algorithm circuit during the physical synthesis process, thereby jointly optimizing the performance of the target algorithm circuit and its peripheral logic.

However, doing physical synthesis is a slow process (~35 seconds for 64b adder), which can significantly slow down RL training and exploration.

The researchers decouple reward calculation from state updates because the agent only needs the current prefix graph state to take action, without circuit synthesis or previous rewards. Thanks to Raptor, they can offload lengthy reward calculations to a pool of CPU workers to perform physics synthesis in parallel, while actor agents can execute in the environment without waiting.

When the CPU worker returns the reward, the transformation can be embedded in the replay buffer. Comprehensive rewards are cached to avoid redundant calculations when a state is encountered again.

Results and Outlook

Figure 5 below shows the area and delay of a 64b adder circuit designed using PrefixRL and the Pareto-dominated adder circuit from the most advanced EDA tools.

The best PrefixRL adders achieve 25% less area than EDA tool adders at the same latency. These prefix graphs mapped to Pareto optimal adder circuits after physical synthesis optimization have irregular structures.

NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient

Figure 5: Arithmetic circuits designed by PrefixRL are smaller than circuits designed by state-of-the-art EDA tools and faster.

(left) circuit architecture; (right) corresponding 64b adder circuit characteristics diagram

As far as we know, this is the first method to use deep reinforcement learning agents to design arithmetic circuits. NVIDIA envisions a blueprint for applying AI to real-world circuit design problems, building action spaces, state representations, RL agent models, optimizing against multiple competing goals, and overcoming slow reward calculations.

The above is the detailed content of NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What does cross-chain transaction mean? What are the cross-chain transactions? What does cross-chain transaction mean? What are the cross-chain transactions? Apr 21, 2025 pm 11:39 PM

Exchanges that support cross-chain transactions: 1. Binance, 2. Uniswap, 3. SushiSwap, 4. Curve Finance, 5. Thorchain, 6. 1inch Exchange, 7. DLN Trade, these platforms support multi-chain asset transactions through various technologies.

WorldCoin (WLD) price forecast 2025-2031: Will WLD reach USD 4 by 2031? WorldCoin (WLD) price forecast 2025-2031: Will WLD reach USD 4 by 2031? Apr 21, 2025 pm 02:42 PM

WorldCoin (WLD) stands out in the cryptocurrency market with its unique biometric verification and privacy protection mechanisms, attracting the attention of many investors. WLD has performed outstandingly among altcoins with its innovative technologies, especially in combination with OpenAI artificial intelligence technology. But how will the digital assets behave in the next few years? Let's predict the future price of WLD together. The 2025 WLD price forecast is expected to achieve significant growth in WLD in 2025. Market analysis shows that the average WLD price may reach $1.31, with a maximum of $1.36. However, in a bear market, the price may fall to around $0.55. This growth expectation is mainly due to WorldCoin2.

Web3 trading platform ranking_Web3 global exchanges top ten summary Web3 trading platform ranking_Web3 global exchanges top ten summary Apr 21, 2025 am 10:45 AM

Binance is the overlord of the global digital asset trading ecosystem, and its characteristics include: 1. The average daily trading volume exceeds $150 billion, supports 500 trading pairs, covering 98% of mainstream currencies; 2. The innovation matrix covers the derivatives market, Web3 layout and education system; 3. The technical advantages are millisecond matching engines, with peak processing volumes of 1.4 million transactions per second; 4. Compliance progress holds 15-country licenses and establishes compliant entities in Europe and the United States.

Why is the rise or fall of virtual currency prices? Why is the rise or fall of virtual currency prices? Why is the rise or fall of virtual currency prices? Why is the rise or fall of virtual currency prices? Apr 21, 2025 am 08:57 AM

Factors of rising virtual currency prices include: 1. Increased market demand, 2. Decreased supply, 3. Stimulated positive news, 4. Optimistic market sentiment, 5. Macroeconomic environment; Decline factors include: 1. Decreased market demand, 2. Increased supply, 3. Strike of negative news, 4. Pessimistic market sentiment, 5. Macroeconomic environment.

How to win KERNEL airdrop rewards on Binance Full process strategy How to win KERNEL airdrop rewards on Binance Full process strategy Apr 21, 2025 pm 01:03 PM

In the bustling world of cryptocurrencies, new opportunities always emerge. At present, KernelDAO (KERNEL) airdrop activity is attracting much attention and attracting the attention of many investors. So, what is the origin of this project? What benefits can BNB Holder get from it? Don't worry, the following will reveal it one by one for you.

Ranking of leveraged exchanges in the currency circle The latest recommendations of the top ten leveraged exchanges in the currency circle Ranking of leveraged exchanges in the currency circle The latest recommendations of the top ten leveraged exchanges in the currency circle Apr 21, 2025 pm 11:24 PM

The platforms that have outstanding performance in leveraged trading, security and user experience in 2025 are: 1. OKX, suitable for high-frequency traders, providing up to 100 times leverage; 2. Binance, suitable for multi-currency traders around the world, providing 125 times high leverage; 3. Gate.io, suitable for professional derivatives players, providing 100 times leverage; 4. Bitget, suitable for novices and social traders, providing up to 100 times leverage; 5. Kraken, suitable for steady investors, providing 5 times leverage; 6. Bybit, suitable for altcoin explorers, providing 20 times leverage; 7. KuCoin, suitable for low-cost traders, providing 10 times leverage; 8. Bitfinex, suitable for senior play

Aavenomics is a recommendation to modify the AAVE protocol token and introduce token repurchase, which has reached the quorum number of people. Aavenomics is a recommendation to modify the AAVE protocol token and introduce token repurchase, which has reached the quorum number of people. Apr 21, 2025 pm 06:24 PM

Aavenomics is a proposal to modify the AAVE protocol token and introduce token repos, which has implemented a quorum for AAVEDAO. Marc Zeller, founder of the AAVE Project Chain (ACI), announced this on X, noting that it marks a new era for the agreement. Marc Zeller, founder of the AAVE Chain Initiative (ACI), announced on X that the Aavenomics proposal includes modifying the AAVE protocol token and introducing token repos, has achieved a quorum for AAVEDAO. According to Zeller, this marks a new era for the agreement. AaveDao members voted overwhelmingly to support the proposal, which was 100 per week on Wednesday

'Black Monday Sell' is a tough day for the cryptocurrency industry 'Black Monday Sell' is a tough day for the cryptocurrency industry Apr 21, 2025 pm 02:48 PM

The plunge in the cryptocurrency market has caused panic among investors, and Dogecoin (Doge) has become one of the hardest hit areas. Its price fell sharply, and the total value lock-in of decentralized finance (DeFi) (TVL) also saw a significant decline. The selling wave of "Black Monday" swept the cryptocurrency market, and Dogecoin was the first to be hit. Its DeFiTVL fell to 2023 levels, and the currency price fell 23.78% in the past month. Dogecoin's DeFiTVL fell to a low of $2.72 million, mainly due to a 26.37% decline in the SOSO value index. Other major DeFi platforms, such as the boring Dao and Thorchain, TVL also dropped by 24.04% and 20, respectively.

See all articles