Nvidia はプルーニングと蒸留を試しています。Llama 3.1 8B のパラメータを半分にして、同じサイズでより良いパフォーマンスを実現しています。-AI-php.cn

ホームページ

テクノロジー周辺機器

Nvidia はプルーニングと蒸留を試しています。Llama 3.1 8B のパラメータを半分にして、同じサイズでより良いパフォーマンスを実現しています。

PHPz

Aug 16, 2024 pm 04:42 PM

エヌビディアプロジェクト

小型モデルの台頭。

先月、Meta は Llama 3.1 シリーズのモデルをリリースしました。これには、Meta のこれまでで最大のモデルである 405B と 2 つの小型モデルが含まれます。パラメータの量はそれぞれ 700 億と 80 億です。

Llama 3.1 は、オープンソースの新時代の到来を告げるものと考えられています。ただし、新世代モデルはパフォーマンスが強力ですが、導入時には依然として大量のコンピューティングリソースが必要です。

したがって、業界では別の傾向が現れています。それは、多くの言語タスクで十分なパフォーマンスを発揮し、導入が非常に安価な小規模言語モデル (SLM) を開発することです。

最近、NVIDIA の研究により、構造化された重み枝刈りと知識の蒸留を組み合わせることで、最初は大きなモデルから徐々に小さな言語モデルを取得できることが示されました。 #🎜🎜 ##### 🎜🎜 ## 🎜🎜 ## 🎜🎜 ## 🎜🎜 ## 🎜🎜#、Meta のチーフ AI サイエンティストである Jann LECun 氏もこの研究を賞賛しました。

NVIDIA 研究チームは、枝刈りと蒸留を経て、Llama 3.1 8B を Llama-3.1-Minitron 4B に改良し、オープンソースにしました。これは、Llama 3.1 オープンソースシリーズにおける Nvidia の最初のリリースです。

Llama-3.1-Minitron 4B は、Minitron 4B、Phi-2 2.7B、Gemma2 2.6B、Qwen2-1.5B など、同様のサイズの最先端のオープンソースモデルよりも優れたパフォーマンスを発揮します。

この研究の関連論文は先月早くも発表されました。

紙のリンク: https://www.arxiv.org/pdf/2407.14679

#🎜 🎜#

論文タイトル: 剪定と知識の蒸留によるコンパクト言語モデル

剪定と蒸留
枝刈りを行うと、モデルがより小さくスリムになります。これは、レイヤーを削除する (深さ枝刈り) か、ニューロンとアテンションヘッドを削除してチャネルを埋め込む (幅枝刈り) ことで実現できます。通常、プルーニングには、精度を回復するためのある程度の再トレーニングが伴います。
モデルの蒸留は、大規模で複雑なモデル (教師モデルと呼ばれることが多い) から、より小さく単純な学生モデルに知識を伝達するための手法です。目標は、元のより大きなモデルの予測能力の多くを維持しながら、より高速に実行し、リソースの消費を少なくする、より効率的なモデルを作成することです。

主な蒸留方法は 2 つあります: SDG ファインチューニングと古典的な知識の蒸留これら 2 つの蒸留方法は補完的です。この記事では、古典的な知識の蒸留方法に焦点を当てます。

NVIDIA では、枝刈りと古典的な知識の抽出を組み合わせた方法を使用して大規模なモデルを構築しています。次の図は、単一モデルの枝刈りおよび抽出のプロセス (上) とモデルの枝刈りおよび抽出のチェーンを示しています (下)。）。具体的なプロセスは次のとおりです:

1. NVIDIA は 15B モデルから開始し、各コンポーネント (レイヤー、ニューロン、ヘッド、エンベディングチャネル) の重要性を評価し、モデルをソートおよびプルーニングして作成します。目標サイズに達しました: 8B モデル。

2 次に、元のモデルを教師、枝刈りしたモデルを生徒として、モデル蒸留を使用して軽い再トレーニングを実行しました。

3. トレーニング後、小さいモデル (8B) を開始点として取り、それを枝刈りしてより小さい 4B モデルに蒸留します。＃🎜🎜## ##

注意すべき点は、モデルを枝刈りする前に、モデルのどの部分が重要であるかを理解する必要があるということです。 NVIDIA は、1024 サンプルの小さなキャリブレーションデータセットを使用して、関連するすべての次元 (深度、ニューロン、ヘッド、埋め込みチャネル) の情報を同時に計算する、アクティベーションベースの純粋な重要性評価戦略を提案しています。必要なのは順方向伝播のみです。このアプローチは、勾配情報に依存しバックプロパゲーションを必要とする戦略よりもシンプルでコスト効率が高くなります。

枝刈り中、特定の軸または軸の組み合わせについて枝刈りと重要度推定を繰り返し交互に行うことができます。実証研究では、単一の重要度推定値を使用するだけで十分であり、反復推定では追加の利点がもたらされないことが示されています。

古典知識の蒸留を用いた再トレーニング

Figure 2 below shows the distillation process, where the N-layer student model (the pruned model) is distilled from the M-layer teacher model (the original unpruned model). The student model is learned by minimizing a combination of embedding output loss, logit loss, and Transformer encoder-specific losses mapped to student blocks S and teacher blocks T. Figure 2: Distillation training loss.

Best practices for pruning and distillationNVIDIA pruning and knowledge distillation based on compact language model Based on extensive ablation research, I summarized my learning results into the following structured compression best practices.

The first is to adjust the size.

To train a set of LLMs, the largest one is trained first, and then iteratively pruned and distilled to obtain smaller LLMs.

If a multi-stage training strategy is used to train the largest model, it is best to prune and retrain the model obtained in the last stage of training.

Prune the available source model closest to the target size.
The second is pruning.
Prioritize width pruning over depth pruning, which works well for models under 15B parameter size.

Use single-shot importance estimation since there is no benefit from iterative importance estimation.

The third is to retrain.
Only use distillation loss for retraining instead of regular training.

Use logit, intermediate state and embedding distillation when depth is significantly reduced.

Use logit-only distillation when depth does not decrease significantly.
Meta recently launched features The powerful Llama 3.1 family of open source models is comparable to closed source models in many benchmarks. Llama 3.1's parameters range from a massive 405B to 70B and 8B.

With the experience of Nemotron distillation, NVIDIA set out to distill the Llama 3.1 8B model into a smaller and more efficient 4B model, taking the following measures:

#🎜 🎜# teacher fine-tuning

Depth-only pruning

Width-only pruning
Accuracy Benchmark
Performance Benchmark
# 🎜🎜#teacher fine-tuning
In order to correct the distribution bias of the original data set on which the model training is based, NVIDIA first trained the unpruned 8B model on their data set (94B tokens) Fine-tuned. Experiments show that if distribution bias is not corrected, the teacher model provides suboptimal guidance for the dataset when distilling.

In order to reduce from 8B to 4B, NVIDIA pruned 16 layers (50%). They first evaluate the importance of each layer or group of consecutive sub-layers by removing them from the model and observe an increase in LM loss or a decrease in accuracy in downstream tasks.

Figure 5 below shows the LM loss values on the validation set after removing 1, 2, 8 or 16 layers. For example, the red plot for layer 16 indicates the LM loss that occurs if the first 16 layers are removed. Layer 17 indicates that LM loss also occurs if the first layer is retained and layers 2 to 17 are deleted. Nvidia observes: The starting and ending layers are the most important.

Figure 5: The importance of depth-only pruning the middle layer.

However, NVIDIA observes that this LM loss is not necessarily directly related to downstream performance. 英伟达玩转剪枝、蒸馏：把Llama 3.1 8B参数减半，性能同尺寸更强

Figure 6 below shows the Winogrande accuracy of each pruned model, which shows that it is best to delete the 16th to 31st layers, where the 31st layer is the penultimate layer, 5 of the pruned model -shot accuracy is significantly higher than random accuracy (0.5). Nvidia adopted this insight and removed layers 16 through 31.

Accuracy.

Width-only pruning

NVIDIA prunes embedding (hidden) and MLP along the width axis Intermediate dimensions to compress Llama 3.1 8B. Specifically, they use the previously described activation-based strategy to compute importance scores for each attention head, embedding channel, and MLP hidden dimension. After importance estimation, NVIDIA chose

to prune the MLP middle dimension from 14336 to 9216.

Prune hidden size from 4096 to 3072.

Retrain attention to the number of heads and the number of layers.

It is worth mentioning that after single-sample pruning, the LM loss of width pruning is higher than that of depth pruning. However, after a brief retraining period, the trend reversed.

Accuracy Benchmark

NVIDIA distilled the model using the following parameters

Peak learning rate = 1e-4
Minimum learning rate = 1e-5
40 steps linear warm-up
Cosine Decay Plan
Global batch size = 1152

Table 1 below shows the Llama-3.1-Minitron 4B model variants (width pruning and depth pruning) similar to the original Llama 3.1 8B model, others Performance comparison of large and small models on benchmarks across multiple domains. Overall, NVIDIA once again confirmed the effectiveness of a wide pruning strategy compared to deep pruning that follows best practices.

^{Table 1: Accuracy comparison of Minitron 4B base model compared to base models of similar scale.}

To verify whether the distilled model can become a powerful instruction model, NVIDIA used NeMo-Aligner to fine-tune the Llama-3.1-Minitron 4B model.

They used Nemotron-4 340B training data and evaluated on IFEval, MT-Bench, ChatRAG-Bench and Berkeley Function Calling Leaderboard (BFCL) to test instruction following, role-playing, RAG and function calling capabilities. Finally, it was confirmed that the Llama-3.1-Minitron 4B model can be a reliable instruction model, outperforming other baseline SLMs.

^{Table 2: Accuracy comparison of aligned Minitron 4B base model with similarly sized aligned models.}

Performance Benchmarks

NVIDIA optimized the Llama 3.1 8B and Llama-3.1-Minitron 4B models using NVIDIA TensorRT-LLM, an open source toolkit for optimizing LLM inference.

The next two figures show the throughput requests per second of different models at FP8 and FP16 precision under different use cases, expressed as the input sequence length/output sequence length (ISL/OSL) combination of the batch size of 32 for the 8B model and The batch size of the 4B model is an input sequence length/output sequence length (ISL/OSL) combination of 64, thanks to the smaller weights allowing a larger batch size on an NVIDIA H100 80GB GPU.

The Llama-3.1-Minitron-4B-Depth-Base variant is the fastest, with an average throughput of about 2.7 times that of Llama 3.1 8B, while the Llama-3.1-Minitron-4B-Width-Base variant has an average throughput of Approximately 1.8 times that of Llama 3.1 8B. Deployment in FP8 also improves the performance of all three models by approximately 1.3x compared to BF16.

^{80GB GPU.}

Conclusion

Pruning and classical knowledge refining is a very cost-effective method to progressively obtain LLMs of smaller sizes, achieving higher accuracy than training from scratch in all domains . This is a more efficient and data-efficient approach than fine-tuning on synthetic data or pre-training from scratch.

Llama-3.1-Minitron 4B is NVIDIA’s first attempt at using the state-of-the-art open source Llama 3.1 series. To use the SDG fine-tuning of Llama-3.1 with NVIDIA NeMo, see the /sdg-law-title-generation section on GitHub.

For more information, please see the following resources:

https://arxiv.org/abs/2407.14679
https://github.com/NVlabs/Minitron
https:// huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base
https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Depth-Base

^{Reference links:}

^{https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b -model/}

以上がNvidia はプルーニングと蒸留を試しています。Llama 3.1 8B のパラメータを半分にして、同じサイズでより良いパフォーマンスを実現しています。の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。

このウェブサイトの声明

この記事の内容はネチズンが自主的に寄稿したものであり、著作権は原著者に帰属します。このサイトは、それに相当する法的責任を負いません。盗作または侵害の疑いのあるコンテンツを見つけた場合は、admin@php.cn までご連絡ください。

ホットAIツール

Undresser.AI Undress

リアルなヌード写真を作成する AI 搭載アプリ

AI Clothes Remover

写真から衣服を削除するオンライン AI ツール。

Undress AI Tool

脱衣画像を無料で

Clothoff.io

AI衣類リムーバー

Video Face Swap

完全無料の AI 顔交換ツールを使用して、あらゆるビデオの顔を簡単に交換できます。

ホットツール

メモ帳++7.3.1

使いやすく無料のコードエディター

SublimeText3 中国語版

中国語版、とても使いやすい

ゼンドスタジオ 13.0.1

強力な PHP 統合開発環境

ドリームウィーバー CS6

ビジュアル Web 開発ツール

SublimeText3 Mac版

神レベルのコード編集ソフト（SublimeText3）

ホットトピック

Java チュートリアル

1672

CakePHP チュートリアル

1428

Laravel チュートリアル

1332

PHP チュートリアル

1277

C# チュートリアル

1257

Related knowledge

ControlNet の作者がまたヒット作を出しました!写真から絵画を生成し、2 日間で 1.4,000 個のスターを獲得する全プロセス Jul 17, 2024 am 01:56 AM

これも Tusheng のビデオですが、PaintsUndo は別の道を歩んでいます。 ControlNet 作者 LvminZhang が再び生き始めました!今回は絵画の分野を目指します。新しいプロジェクト PaintsUndo は、開始されて間もなく 1.4kstar を獲得しました (まだ異常なほど上昇しています)。プロジェクトアドレス: https://github.com/lllyasviel/Paints-UNDO このプロジェクトを通じて、ユーザーが静止画像を入力すると、PaintsUndo が線画から完成品までのペイントプロセス全体のビデオを自動的に生成するのに役立ちます。。描画プロセス中の線の変化は驚くべきもので、最終的なビデオ結果は元の画像と非常によく似ています。完成した描画を見てみましょう。

NVIDIA 対話モデル ChatQA はバージョン 2.0 に進化し、コンテキストの長さは 128K と記載されています Jul 26, 2024 am 08:40 AM

オープンな LLM コミュニティは百花繚乱の時代です Llama-3-70B-Instruct、QWen2-72B-Instruct、Nemotron-4-340B-Instruct、Mixtral-8x22BInstruct-v0.1 などがご覧いただけます。優秀なパフォーマーモデル。しかし、GPT-4-Turboに代表される独自の大型モデルと比較すると、オープンモデルには依然として多くの分野で大きなギャップがあります。一般的なモデルに加えて、プログラミングと数学用の DeepSeek-Coder-V2 や視覚言語タスク用の InternVL など、主要な領域に特化したいくつかのオープンモデルが開発されています。

オープンソース AI ソフトウェアエンジニアのリストのトップに立つ UIUC のエージェントレスソリューションは、SWE ベンチの実際のプログラミングの問題を簡単に解決します Jul 17, 2024 pm 10:02 PM

AIxivコラムは、当サイトが学術的・技術的な内容を掲載するコラムです。過去数年間で、このサイトの AIxiv コラムには 2,000 件を超えるレポートが寄せられ、世界中の主要な大学や企業のトップ研究室がカバーされ、学術交流と普及を効果的に促進しています。共有したい優れた作品がある場合は、お気軽に寄稿するか、報告のために当社までご連絡ください。提出電子メール: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com この論文の著者は全員、イリノイ大学アーバナシャンペーン校 (UIUC) の Zhang Lingming 教師のチームのメンバーです。博士課程4年、研究者

arXiv 論文は「弾幕」として投稿可能、スタンフォード alphaXiv ディスカッションプラットフォームはオンライン、LeCun は気に入っています Aug 01, 2024 pm 05:18 PM

乾杯！紙面でのディスカッションが言葉だけになると、どんな感じになるでしょうか?最近、スタンフォード大学の学生が、arXiv 論文のオープンディスカッションフォーラムである alphaXiv を作成しました。このフォーラムでは、arXiv 論文に直接質問やコメントを投稿できます。 Web サイトのリンク: https://alphaxiv.org/ 実際、URL の arXiv を alphaXiv に変更するだけで、alphaXiv フォーラムの対応する論文を直接開くことができます。この Web サイトにアクセスする必要はありません。その中の段落を正確に見つけることができます。論文、文: 右側のディスカッションエリアでは、ユーザーは論文のアイデアや詳細について著者に尋ねる質問を投稿できます。たとえば、次のような論文の内容についてコメントすることもできます。

OpenAI Super Alignment チームの遺作: 2 つの大きなモデルがゲームをプレイし、出力がより理解しやすくなる Jul 19, 2024 am 01:29 AM

AIモデルによって与えられた答えがまったく理解できない場合、あなたはそれをあえて使用しますか?機械学習システムがより重要な分野で使用されるにつれて、なぜその出力を信頼できるのか、またどのような場合に信頼してはいけないのかを実証することがますます重要になっています。複雑なシステムの出力に対する信頼を得る方法の 1 つは、人間または他の信頼できるシステムが読み取れる、つまり、考えられるエラーが発生する可能性がある点まで完全に理解できる、その出力の解釈を生成することをシステムに要求することです。見つかった。たとえば、司法制度に対する信頼を築くために、裁判所に対し、決定を説明し裏付ける明確で読みやすい書面による意見を提供することを求めています。大規模な言語モデルの場合も、同様のアプローチを採用できます。ただし、このアプローチを採用する場合は、言語モデルが

リーマン予想の大きな進歩!陶哲軒氏はMITとオックスフォードの新しい論文を強く推薦し、37歳のフィールズ賞受賞者も参加した Aug 05, 2024 pm 03:32 PM

最近、2000年代の7大問題の一つとして知られるリーマン予想が新たなブレークスルーを達成した。リーマン予想は、数学における非常に重要な未解決の問題であり、素数の分布の正確な性質に関連しています (素数とは、1 とそれ自身でのみ割り切れる数であり、整数論において基本的な役割を果たします)。今日の数学文献には、リーマン予想 (またはその一般化された形式) の確立に基づいた 1,000 を超える数学的命題があります。言い換えれば、リーマン予想とその一般化された形式が証明されれば、これらの 1,000 を超える命題が定理として確立され、数学の分野に重大な影響を与えることになります。これらの命題の一部も有効性を失います。 MIT数学教授ラリー・ガスとオックスフォード大学から新たな進歩がもたらされる

LLM は時系列予測にはあまり適していません。推論機能も使用しません。 Jul 15, 2024 pm 03:59 PM

言語モデルは本当に時系列予測に使用できるのでしょうか?ベタリッジの見出しの法則 (疑問符で終わるニュース見出しは「いいえ」と答えることができます) によれば、答えは「いいえ」であるはずです。このような強力な LLM は時系列データを適切に処理できないという事実は真実のようです。時系列、つまり時系列とは、その名の通り、時間順に並べられた一連のデータ点のことを指します。時系列分析は、病気の蔓延予測、小売分析、ヘルスケア、金融などの多くの分野で重要です。時系列分析の分野では、多くの研究者が最近、大規模言語モデル (LLM) を使用して時系列の異常を分類、予測、検出する方法を研究しています。これらの論文では、テキスト内の逐次依存関係の処理に優れた言語モデルは時系列にも一般化できると想定しています。

最初の Mamba ベースの MLLM が登場しました!モデルの重み、トレーニングコードなどはすべてオープンソースです Jul 17, 2024 am 02:46 AM

AIxivコラムは、当サイトが学術的・技術的な内容を掲載するコラムです。過去数年間で、このサイトの AIxiv コラムには 2,000 件を超えるレポートが寄せられ、世界中の主要な大学や企業のトップ研究室がカバーされ、学術交流と普及を効果的に促進しています。共有したい優れた作品がある場合は、お気軽に寄稿するか、報告のために当社までご連絡ください。提出電子メール: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com。はじめに近年、さまざまな分野でマルチモーダル大規模言語モデル (MLLM) の適用が目覚ましい成功を収めています。ただし、多くの下流タスクの基本モデルとして、現在の MLLM はよく知られた Transformer ネットワークで構成されています。

See all articles