Home Technology peripherals AI Summarizing 374 related works, Tao Dacheng's team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation

Summarizing 374 related works, Tao Dacheng's team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation

Mar 18, 2024 pm 07:49 PM
industry large language model knowledge distillation

Large Language Models (LLMs) have developed rapidly in the past two years, and some phenomenal models and products have emerged, such as GPT-4, Gemini, Claude, etc., but most of them are closed source. There is a large gap between most open source LLMs currently accessible to the research community and closed source LLMs. Therefore, improving the capabilities of open source LLMs and other small models to reduce the gap between them and closed source large models has become a research hotspot in this field.

The powerful capabilities of LLM, especially closed-source LLM, enable scientific researchers and industrial practitioners to utilize the output and knowledge of these large models when training their own models. This process is essentially a knowledge distillation (KD) process, that is, distilling knowledge from a teacher model (such as GPT-4) into a smaller model (such as Llama), which significantly improves the capabilities of the small model. . It can be seen that the knowledge distillation technology of large language models is ubiquitous and is a cost-effective and effective method for researchers to help train and improve their own models.

So, how does the current work utilize closed-source LLM for knowledge distillation and data acquisition? How to efficiently train this knowledge into small models? What powerful skills can small models acquire from teacher models? How does LLM’s knowledge distillation play a role in industry with domain characteristics? These issues deserve in-depth thinking and research.

In 2020, Tao Dacheng's team published "Knowledge Distillation: A Survey", which comprehensively explored the application of knowledge distillation in deep learning. This technology is mainly used for model compression and acceleration. With the rise of large-scale language models, the application fields of knowledge distillation have been continuously expanded, which can not only improve the performance of small models, but also achieve model self-improvement.

In early 2024, Tao Dacheng's team collaborated with the University of Hong Kong and the University of Maryland to publish the latest review "A Survey on Knowledge Distillation of Large Language Models", which summarized 374 related works and discussed how to learn from large language models. Acquire knowledge, train smaller models, and the role of knowledge distillation in model compression and self-training. At the same time, this review also covers the distillation of large language model skills and the distillation of vertical fields, helping researchers to fully understand how to train and improve their own models.

Summarizing 374 related works, Tao Dachengs team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation

  • Paper title: A Survey on Knowledge Distillation of Large Language Models

  • Paper link: https: //arxiv.org/abs/2402.13116

  • Project link: https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs

Overview Architecture

The overall framework of large language model knowledge distillation is summarized as follows:

Summarizing 374 related works, Tao Dachengs team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation

First, According to the process of knowledge distillation of large language models, this review decomposes knowledge distillation into two steps:

1. Knowledge Elicitation: How to obtain knowledge from the teacher model. The process mainly includes:

a) First build instructions to identify the skills or vertical areas of competency to be distilled from the teacher model.

b) Then use seed knowledge (such as a certain data set) as input to drive the teacher model and generate corresponding responses, thereby guiding the corresponding knowledge.

c) At the same time, the acquisition of knowledge includes some specific technologies: annotation, expansion, synthesis, feature extraction, feedback, and own knowledge.

2.

Distillation Algorithms: That is, how to inject the acquired knowledge into the student model. The specific algorithms in this part include: supervised fine-tuning, divergence and similarity, reinforcement learning (reinforcement learning from AI feedback, RLAIF), and ranking optimization.

The classification method of this review summarizes related work from three dimensions based on this process: knowledge distillation algorithms, skill distillation, and vertical field distillation. The latter two are distilled based on knowledge distillation algorithms. The details of this classification and a summary of the corresponding related work are shown in the figure below.

Summarizing 374 related works, Tao Dachengs team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation

Knowledge distillation algorithm

Knowledge Elicitation

According to the Regarding the ways to acquire knowledge in the teacher model, this review divides its technologies into labeling, expansion, data synthesis, feature extraction, feedback, and self-generated knowledge. knowledge). Examples of each method are shown in the figure below:

Summarizing 374 related works, Tao Dachengs team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation

Labeling : Knowledge labeling refers to the teacher LLMs giving instructions based on instructions or examples. Certain inputs are used as seed knowledge to generate corresponding outputs. For example, seed knowledge is the input of a certain data set, and the teacher model labels the output of the thinking chain.

Expansion: A key feature of this technology is to use the contextual learning capabilities of LLMs to generate data similar to the example based on the provided seed example. The advantage is that more diverse and extensive data sets can be generated through examples. However, as the generated data continues to increase, data homogeneity problems may arise.

Data synthesis (Data Curation): A distinctive feature of data synthesis is that it synthesizes data from scratch. It uses a large amount of meta-information (such as topics, knowledge documents, original data, etc.) as diverse and huge amounts of seed knowledge to obtain large-scale and high-quality data sets from teacher LLMs.

Feature acquisition (Feature): The typical method to obtain feature knowledge is to output the input and output sequences to teacher LLMs, and then extract its internal representation. This method is mainly suitable for open source LLMs and is often used for model compression.

Feedback: Feedback knowledge usually provides feedback to the teacher model on the student's output, such as providing preferences, evaluation or correction information to guide students to generate better output.

Self-Knowledge: Knowledge can also be obtained from students themselves, which is called self-generated knowledge. In this case, the same model acts as both teacher and student, iteratively improving itself by distilling techniques and improving its own previously generated output. This approach works well for open source LLMs.

Summary: At present, the extension method is still widely used, and the data synthesis method has gradually become mainstream because it can generate a large amount of high-quality data. Feedback methods can provide knowledge that helps student models improve their alignment capabilities. Feature acquisition and self-generated knowledge methods have become popular due to the use of open source large models as teacher models. The feature acquisition method helps compress open source models, while the self-generated knowledge method can continuously improve large language models. Importantly, the above methods can be effectively combined, and researchers can explore different combinations to elicit more effective knowledge.

Distilling Algorithms

After acquiring knowledge, it needs to be distilled into the student model. Distillation algorithms include: supervised fine-tuning, divergence and similarity, reinforcement learning, and ranking optimization. An example is shown below:

Summarizing 374 related works, Tao Dachengs team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation

Supervised fine-tuning: Supervised fine-tuning (SFT) fine-tunes the student model by maximizing the likelihood of the sequences generated by the teacher model, allowing the student model to imitate the teacher model. This is currently the most commonly used technique in knowledge distillation of LLMs.

Divergence and Similarity: This algorithm uses the internal parameter knowledge of the teacher model as a supervision signal for student model training, and is suitable for open source teacher models. Methods based on divergence and similarity align probability distributions and hidden states respectively.

Reinforcement Learning : This algorithm is suitable for using teacher feedback knowledge to train student models, that is, RLAIF technology. There are two main aspects: (1) using teacher-generated feedback data to train a student reward model, (2) optimizing the student model by maximizing the expected reward through the trained reward model. Teachers can also serve directly as reward models.

Rank Optimization: Ranking optimization can also inject preference knowledge into the student model. Its advantages are stability and high computational efficiency, such as some classic algorithms such as DPO, RRHF, etc.

SKILL DISTILLATION

As we all know, large language models have many excellent capabilities. Through knowledge distillation technology, instructions are provided to control teachers to generate knowledge containing corresponding skills and train student models so that they can acquire these abilities. These capabilities mainly include capabilities such as following context (such as instructions), alignment, agents, natural language processing (NLP) tasks, and multi-modality.

The following table summarizes the classic work of skill distillation, and also summarizes the skills, seed knowledge, teacher model, student model, knowledge acquisition method, and distillation algorithm involved in each work.

Summarizing 374 related works, Tao Dachengs team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation

Vertical Domain Distillation

In addition to large language models in general fields, there are now a lot of work training large language models in vertical fields , which helps the research community and industry to apply and deploy large language models. Although large language models (such as GPT-4) have limited domain knowledge in vertical fields, they can still provide some domain knowledge and capabilities or enhance existing domain data sets. The fields involved here mainly include (1) law, (2) medical health, (3) finance, (4) science, and some other fields. The taxonomy and related work in this part are shown below:

Summarizing 374 related works, Tao Dachengs team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation

Future Directions

This review explores the current large language models Problems in knowledge distillation and potential future research directions mainly include:

  • Data selection: How to automatically select data to achieve better distillation results?

  • Multi-teacher distillation: Explore the distillation of knowledge from different teacher models into one student model.

  • Richer knowledge in the teacher model: You can explore richer knowledge in the teacher model, including feedback and feature knowledge, as well as explore multiple knowledge acquisition methods. combination.

  • Overcoming catastrophic forgetting during distillation: The ability to effectively preserve the original model during knowledge distillation or transfer remains a challenging issue.

  • Trusted Knowledge Distillation: At present, KD mainly focuses on distilling various skills, and pays relatively little attention to the credibility of large models.

  • Weak-to-Strong Distillation(Weak-to-Strong Distillation). OpenAI proposes the concept of "weak-to-strong generalization", which requires exploring innovative technical strategies so that weaker models can effectively guide the learning process of stronger models.

  • Self-Alignment (Self-Distillation). Instructions can be designed so that the student model autonomously improves and aligns its generated content by generating feedback, criticism, and explanations.

Conclusion

This review provides a comprehensive and comprehensive review of how to use the knowledge of large language models to improve student models, such as open source large language models. Systematically summarized, including the recently popular self-distillation technology. This review divides knowledge distillation into two steps: knowledge acquisition and distillation algorithm, and also summarizes skill distillation and vertical field distillation. Finally, this review explores the future direction of distilling large language models, hoping to push the boundaries of knowledge distillation of large language models and obtain large language models that are more accessible, efficient, effective, and credible.

The above is the detailed content of Summarizing 374 related works, Tao Dacheng's team, together with the University of Hong Kong and UMD, released the latest review of LLM knowledge distillation. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

DeepMind robot plays table tennis, and its forehand and backhand slip into the air, completely defeating human beginners DeepMind robot plays table tennis, and its forehand and backhand slip into the air, completely defeating human beginners Aug 09, 2024 pm 04:01 PM

But maybe he can’t defeat the old man in the park? The Paris Olympic Games are in full swing, and table tennis has attracted much attention. At the same time, robots have also made new breakthroughs in playing table tennis. Just now, DeepMind proposed the first learning robot agent that can reach the level of human amateur players in competitive table tennis. Paper address: https://arxiv.org/pdf/2408.03906 How good is the DeepMind robot at playing table tennis? Probably on par with human amateur players: both forehand and backhand: the opponent uses a variety of playing styles, and the robot can also withstand: receiving serves with different spins: However, the intensity of the game does not seem to be as intense as the old man in the park. For robots, table tennis

The first mechanical claw! Yuanluobao appeared at the 2024 World Robot Conference and released the first chess robot that can enter the home The first mechanical claw! Yuanluobao appeared at the 2024 World Robot Conference and released the first chess robot that can enter the home Aug 21, 2024 pm 07:33 PM

On August 21, the 2024 World Robot Conference was grandly held in Beijing. SenseTime's home robot brand "Yuanluobot SenseRobot" has unveiled its entire family of products, and recently released the Yuanluobot AI chess-playing robot - Chess Professional Edition (hereinafter referred to as "Yuanluobot SenseRobot"), becoming the world's first A chess robot for the home. As the third chess-playing robot product of Yuanluobo, the new Guoxiang robot has undergone a large number of special technical upgrades and innovations in AI and engineering machinery. For the first time, it has realized the ability to pick up three-dimensional chess pieces through mechanical claws on a home robot, and perform human-machine Functions such as chess playing, everyone playing chess, notation review, etc.

Claude has become lazy too! Netizen: Learn to give yourself a holiday Claude has become lazy too! Netizen: Learn to give yourself a holiday Sep 02, 2024 pm 01:56 PM

The start of school is about to begin, and it’s not just the students who are about to start the new semester who should take care of themselves, but also the large AI models. Some time ago, Reddit was filled with netizens complaining that Claude was getting lazy. "Its level has dropped a lot, it often pauses, and even the output becomes very short. In the first week of release, it could translate a full 4-page document at once, but now it can't even output half a page!" https:// www.reddit.com/r/ClaudeAI/comments/1by8rw8/something_just_feels_wrong_with_claude_in_the/ in a post titled "Totally disappointed with Claude", full of

At the World Robot Conference, this domestic robot carrying 'the hope of future elderly care' was surrounded At the World Robot Conference, this domestic robot carrying 'the hope of future elderly care' was surrounded Aug 22, 2024 pm 10:35 PM

At the World Robot Conference being held in Beijing, the display of humanoid robots has become the absolute focus of the scene. At the Stardust Intelligent booth, the AI ​​robot assistant S1 performed three major performances of dulcimer, martial arts, and calligraphy in one exhibition area, capable of both literary and martial arts. , attracted a large number of professional audiences and media. The elegant playing on the elastic strings allows the S1 to demonstrate fine operation and absolute control with speed, strength and precision. CCTV News conducted a special report on the imitation learning and intelligent control behind "Calligraphy". Company founder Lai Jie explained that behind the silky movements, the hardware side pursues the best force control and the most human-like body indicators (speed, load) etc.), but on the AI ​​side, the real movement data of people is collected, allowing the robot to become stronger when it encounters a strong situation and learn to evolve quickly. And agile

ACL 2024 Awards Announced: One of the Best Papers on Oracle Deciphering by HuaTech, GloVe Time Test Award ACL 2024 Awards Announced: One of the Best Papers on Oracle Deciphering by HuaTech, GloVe Time Test Award Aug 15, 2024 pm 04:37 PM

At this ACL conference, contributors have gained a lot. The six-day ACL2024 is being held in Bangkok, Thailand. ACL is the top international conference in the field of computational linguistics and natural language processing. It is organized by the International Association for Computational Linguistics and is held annually. ACL has always ranked first in academic influence in the field of NLP, and it is also a CCF-A recommended conference. This year's ACL conference is the 62nd and has received more than 400 cutting-edge works in the field of NLP. Yesterday afternoon, the conference announced the best paper and other awards. This time, there are 7 Best Paper Awards (two unpublished), 1 Best Theme Paper Award, and 35 Outstanding Paper Awards. The conference also awarded 3 Resource Paper Awards (ResourceAward) and Social Impact Award (

Hongmeng Smart Travel S9 and full-scenario new product launch conference, a number of blockbuster new products were released together Hongmeng Smart Travel S9 and full-scenario new product launch conference, a number of blockbuster new products were released together Aug 08, 2024 am 07:02 AM

This afternoon, Hongmeng Zhixing officially welcomed new brands and new cars. On August 6, Huawei held the Hongmeng Smart Xingxing S9 and Huawei full-scenario new product launch conference, bringing the panoramic smart flagship sedan Xiangjie S9, the new M7Pro and Huawei novaFlip, MatePad Pro 12.2 inches, the new MatePad Air, Huawei Bisheng With many new all-scenario smart products including the laser printer X1 series, FreeBuds6i, WATCHFIT3 and smart screen S5Pro, from smart travel, smart office to smart wear, Huawei continues to build a full-scenario smart ecosystem to bring consumers a smart experience of the Internet of Everything. Hongmeng Zhixing: In-depth empowerment to promote the upgrading of the smart car industry Huawei joins hands with Chinese automotive industry partners to provide

Li Feifei's team proposed ReKep to give robots spatial intelligence and integrate GPT-4o Li Feifei's team proposed ReKep to give robots spatial intelligence and integrate GPT-4o Sep 03, 2024 pm 05:18 PM

Deep integration of vision and robot learning. When two robot hands work together smoothly to fold clothes, pour tea, and pack shoes, coupled with the 1X humanoid robot NEO that has been making headlines recently, you may have a feeling: we seem to be entering the age of robots. In fact, these silky movements are the product of advanced robotic technology + exquisite frame design + multi-modal large models. We know that useful robots often require complex and exquisite interactions with the environment, and the environment can be represented as constraints in the spatial and temporal domains. For example, if you want a robot to pour tea, the robot first needs to grasp the handle of the teapot and keep it upright without spilling the tea, then move it smoothly until the mouth of the pot is aligned with the mouth of the cup, and then tilt the teapot at a certain angle. . this

Tested 7 'Sora-level' video generation artifacts. Who has the ability to ascend to the 'Iron Throne'? Tested 7 'Sora-level' video generation artifacts. Who has the ability to ascend to the 'Iron Throne'? Aug 05, 2024 pm 07:19 PM

Editor of Machine Power Report: Yang Wen Who can become the King of AI video circle? In the American TV series "Game of Thrones", there is an "Iron Throne". Legend has it that it was made by the giant dragon "Black Death" who melted thousands of swords discarded by enemies, symbolizing supreme authority. In order to sit on this iron chair, the major families started fighting and fighting. Since the emergence of Sora, a vigorous "Game of Thrones" has been launched in the AI ​​video circle. The main players in this game include RunwayGen-3 and Luma from across the ocean, as well as domestic Kuaishou Keling, ByteDream, and Zhimo. Spectrum Qingying, Vidu, PixVerseV2, etc. Today we are going to evaluate and see who is qualified to sit on the "Iron Throne" of the AI ​​video circle. -1- Vincent Video

See all articles