


Nvidia releases TensorRT-LLM open source software to improve AI model performance on high-end GPU chips
Nvidia recently announced the launch of a new open source software suite called TensorRT-LLM, which expands the capabilities of large language model optimization on Nvidia GPUs and breaks through artificial intelligence inference performance after deployment limit.
Generative AI large language models have become popular due to their impressive capabilities. It expands the possibilities of artificial intelligence and is widely used in various industries. Users can obtain information by talking to chatbots, summarize large documents, write software code, and discover new ways to understand information
Ian Buck, vice president of hyperscale and high-performance computing at Nvidia, said: "Large-scale language models Inference becomes increasingly difficult. It is natural for models to become smarter and larger as their complexity increases, but when models scale beyond a single GPU and must run on multiple When running on a GPU, it becomes a big problem. "
In terms of artificial intelligence, inference is a process in which the model processes new data that has never been seen before, such as for summarizing, generating code, Providing suggestions or answering questions is the workhorse of large language models.
With the rapid expansion of the model ecosystem, models are becoming larger and larger with richer functions. This also means that the model becomes so large that it cannot be run simultaneously on a single GPU and must be split. Developers and engineers must manually distribute and coordinate workloads to get responses in real time. TensorRT-LLM solves this problem by implementing "tensor parallelism", allowing large-scale and efficient inference on multiple GPUs
In addition, due to the variety of large-scale Language model, so Nvidia has optimized the core for the current mainstream large-scale language model. The software suite includes fully optimized, ready-to-run versions of large-scale language models, including Meta Platform’s Llama 2, OpenAI’s GPT-2 and GPT-3, Falcon, MosaicMPT, and BLOOM.
"On-the-fly batching" mechanism to deal with dynamic workloads
Due to the nature of large language models themselves, the workload of the model may be highly dynamic, the workload requirements and task usage It may also change over time, and a single model can be used simultaneously as a chatbot to ask and answer questions, and it can be used to summarize large documents and short documents. Therefore, the output size may be of completely different orders of magnitude.
In order to cope with these different workloads, TensorRT-LLM introduces a mechanism called "running batching", which is an optimized scheduling process that breaks the text generation process into multiple fragments. So that it can be moved in and out of the GPU so that the entire batch of workload does not need to be completed before starting a new batch.
Previously, if there was a large request, such as summarizing a very large document, everything behind it would have to wait for the process to complete before the queue could move forward.
Nvidia has been working with many vendors to optimize TensorRT-LLM, including Meta, Cohere, Grammarly, Databricks and Tabnine. With their help, Nvidia continues to streamline the functionality and toolset within its software suite, including the open source Python application user interface for defining and optimizing new architectures for customizing large language models.
For example, when MosaicML integrated TensorRT-LLM with its existing software stack, it added additional functionality on top of TensorRT-LLM. Naveen Rao, vice president of engineering at Databricks, said that the process is very simple
"TensorRT-LLM is easy to use, rich in features, including token streaming, dynamic batching, paged attention, quantification, etc., and it is very efficient. "Providing optimal performance for serving large language models using NVIDIA GPUs and allowing us to pass cost savings back to our customers."
Nvidia said that TensorRT-LLM and the benefits it brings, including Batch processing function) can increase the inference performance of article summarization using Nvidia H100 by more than 1x. When using the GPT-J-6B model to perform the A100 test on the CNN/Daily Mail article summary, using only H100 was 4 times faster than A100, and after enabling TensorRT-LLM optimization, the speed was increased by 8 times
TensorRT-LLM provides developers and engineers with a deep learning compiler, optimized large language model kernels, pre- and post-processing, multi-GPU/multi-node communication capabilities, and a simple open source API, allowing them to quickly optimize and Perform inference for large language model production. As large language models continue to reshape the data center, enterprises' demand for higher performance means that developers, more than ever, need tools that give them the functionality and access to deliver higher-performing results.
The TensorRT-LLM software suite is now available for early access to developers in the Nvidia Developer Program and will be integrated into the NeMo framework for production AI end-to-end software platform Nvidia AI Enterprise next month. The TensorRT-LLM software suite has been released for early access by developers in the Nvidia Developer Program and will be integrated into Nvidia AI Enterprise’s NeMo framework next month for a production AI end-to-end software platform
The above is the detailed content of Nvidia releases TensorRT-LLM open source software to improve AI model performance on high-end GPU chips. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Recently, the "Lingang New Area Intelligent Computing Conference" with the theme of "AI leads the era, computing power drives the future" was held. At the meeting, the New Area Intelligent Computing Industry Alliance was formally established. SenseTime became a member of the alliance as a computing power provider. At the same time, SenseTime was awarded the title of "New Area Intelligent Computing Industry Chain Master" enterprise. As an active participant in the Lingang computing power ecosystem, SenseTime has built one of the largest intelligent computing platforms in Asia - SenseTime AIDC, which can output a total computing power of 5,000 Petaflops and support 20 ultra-large models with hundreds of billions of parameters. Train at the same time. SenseCore, a large-scale device based on AIDC and built forward-looking, is committed to creating high-efficiency, low-cost, and large-scale next-generation AI infrastructure and services to empower artificial intelligence.

IT House reported on October 13 that "Joule", a sister journal of "Cell", published a paper this week called "The growing energy footprint of artificial intelligence (The growing energy footprint of artificial intelligence)". Through inquiries, we learned that this paper was published by Alex DeVries, the founder of the scientific research institution Digiconomist. He claimed that the reasoning performance of artificial intelligence in the future may consume a lot of electricity. It is estimated that by 2027, the electricity consumption of artificial intelligence may be equivalent to the electricity consumption of the Netherlands for a year. Alex DeVries said that the outside world has always believed that training an AI model is "the most important thing in AI".

Driving China News on June 28, 2023, today during the Mobile World Congress in Shanghai, China Unicom released the graphic model "Honghu Graphic Model 1.0". China Unicom said that the Honghu graphic model is the first large model for operators' value-added services. China Business News reporter learned that Honghu’s graphic model currently has two versions of 800 million training parameters and 2 billion training parameters, which can realize functions such as text-based pictures, video editing, and pictures-based pictures. In addition, China Unicom Chairman Liu Liehong also said in today's keynote speech that generative AI is ushering in a singularity of development, and 50% of jobs will be profoundly affected by artificial intelligence in the next two years.

I believe that friends who follow the mobile phone circle will not be unfamiliar with the phrase "get a score if you don't accept it". For example, theoretical performance testing software such as AnTuTu and GeekBench have attracted much attention from players because they can reflect the performance of mobile phones to a certain extent. Similarly, there are corresponding benchmarking software for PC processors and graphics cards to measure their performance. Since "everything can be benchmarked", the most popular large AI models have also begun to participate in benchmarking competitions, especially in the "Hundred Models" After the "war" began, there were breakthroughs almost every day. Each company claimed to be "the first in running scores." The large domestic AI models almost never fell behind in terms of performance scores, but they were never able to surpass GP in terms of user experience.

The Transformer model comes from the paper "Attentionisallyouneed" published by the Google team in 2017. This paper first proposed the concept of using Attention to replace the cyclic structure of the Seq2Seq model, which brought a great impact to the NLP field. And with the continuous advancement of research in recent years, Transformer-related technologies have gradually flowed from natural language processing to other fields. Up to now, the Transformer series models have become mainstream models in NLP, CV, ASR and other fields. Therefore, how to train and infer Transformer models faster has become an important research direction in the industry. Low-precision quantization techniques can

IT House reported on November 3 that the official website of the Institute of Physics of the Chinese Academy of Sciences published an article. Recently, the SF10 Group of the Institute of Physics of the Chinese Academy of Sciences/Beijing National Research Center for Condensed Matter Physics and the Computer Network Information Center of the Chinese Academy of Sciences collaborated to apply large AI models to materials science. In the field, tens of thousands of chemical synthesis pathway data are fed to the large language model LLAMA2-7b, thereby obtaining a MatChat model, which can be used to predict the synthesis pathways of inorganic materials. IT House noted that the model can perform logical reasoning based on the queried structure and output the corresponding preparation process and formula. It has been deployed online and is open to all materials researchers, bringing new inspiration and new ideas to materials research and innovation. This work is for large language models in the field of segmented science

The artificial intelligence department of Meta Platforms recently stated that they are teaching AI models how to learn to walk in the physical world with the support of a small amount of training data, and have made rapid progress. This research can significantly shorten the time for AI models to acquire visual navigation capabilities. Previously, achieving such goals required repeated "reinforcement learning" using large data sets. Meta AI researchers said that this exploration of AI visual navigation will have a significant impact on the virtual world. The basic idea of the project is not complicated: to help AI navigate physical space just like humans do, simply through observation and exploration. The Meta AI department explained, “For example, if we want AR glasses to guide us to find keys, we must

Nvidia recently announced the launch of a new open source software suite called TensorRT-LLM, which expands the capabilities of large language model optimization on Nvidia GPUs and breaks the limits of artificial intelligence inference performance after deployment. Generative AI large language models have become popular due to their impressive capabilities. It expands the possibilities of artificial intelligence and is widely used in various industries. Users can obtain information by talking to chatbots, summarize large documents, write software code, and discover new ways to understand information, said Ian Buck, vice president of hyperscale and high-performance computing at Nvidia Corporation: "Large language model inference is becoming increasingly difficult. .The complexity of the model continues to increase, the model becomes more and more intelligent, and it becomes
