Fine-Tuning an Open-Source LLM with Axolotl Using Direct Preference Optimization (DPO)-It Industry-php.cn

Fine-Tuning an Open-Source LLM with Axolotl Using Direct Preference Optimization (DPO)

The emergence of large language models (LLMs) has brought countless new opportunities for AI applications. If you've always wanted to fine-tune your own model, this guide will show you how to do this easily without writing any code. We will use tools such as Axolotl and DPO to guide you through the entire process step by step.

What is a large language model (LLM)?

Large Language Model (LLM) is a powerful AI model that trains on massive text data (trillions of characters) to predict the next phrase in a sequence. This has only been possible with advances in GPU computing over the past 2-3 years that have enabled such a large model to be trained in a few weeks.

You may have interacted with LLM before through products such as ChatGPT or Claude and have personally experienced their ability to understand and generate human-like responses.

Why do you need to fine-tune LLM?

Can't we just use GPT-4o to handle everything? While it is the most powerful model we have at the time of writing, it is not always the most practical option. Fine-tuning a smaller model (parameter range of 3 billion to 14 billion) can achieve comparable results at a small fraction of the cost. Additionally, fine-tuning allows you to own your own intellectual property and reduce reliance on third parties.

Understanding the basic models, instruction models and dialogue models

Be sure to understand the different types of LLMs available before tweaking in depth:

Basic Models: These models are pre-trained on large amounts of unstructured text such as books or Internet data. Although they have an inherent understanding of the language, they are not optimized for reasoning and produce incoherent output. The development of basic models is to serve as a starting point for developing more professional models.
Instruction Model: Instruction Model is built on the basic model and fine-tuned using structured data (such as prompt-response pair). They are designed to follow specific instructions or answer questions.
Dialogue Model: is also built on the basic model, but unlike the instruction model, the dialogue model is trained on the dialogue data, allowing them to have back and forth conversations.

What are reinforcement learning and DPO?

Reinforcement learning (RL) is a technique in which models learn by receiving feedback on their behavior. It is applied to instruction models or dialogue models to further improve the quality of its output. Typically, RL does not work on top of the underlying model because it uses a lower learning rate, which is not enough to make significant changes.

DPO is an RL form where the model is trained using the pros and cons of the same prompt/dialogue answer pair. By presenting these pairs, the model learns favorable examples and avoids bad examples.

When to use DPO

DPO is especially useful when you want to adjust the style or behavior of your model, for example:

Style adjustment: Modify the length, level of detail, or confidence in the model expression of the response.
Safety Measures: Training model refuses to answer prompts that may be unsafe or inappropriate.

However, DPO is not suitable for teaching new knowledge or facts about models. For this purpose, supervised fine-tuning (SFT) or search-enhanced generation (RAG) techniques are more suitable.

Create DPO dataset

In production environments, you usually use user feedback to generate DPO datasets, such as:

User feedback: Implement the like/click mechanism in response.
Compare choices: presents two different outputs to the user and asks them to choose the better one.

If you lack user data, you can also create synthetic datasets by leveraging larger, more powerful LLMs. For example, you could use a smaller model to generate the wrong answer and then correct it with GPT-4o.

For simplicity, we will use HuggingFace's ready-made dataset: olivermolenschot/alpaca_messages_dpo_test. If you examine the dataset, you will notice that it contains hints with selected and rejected answers – these are good and bad examples. These data were created using GPT-3.5-turbo and GPT-4 synthesis.

You usually need at least 500 to 1000 pairs of data to be trained effectively without overfitting. The largest DPO dataset contains up to 15,000-20,000 pairs of data.

Use Axolotl to fine-tune the Qwen2.5 3B instruction model

We will use Axolotl to fine-tune the Qwen2.5 3B instruction model, which is currently ranked No. 1 in the OpenLLM rankings in its scale category. With Axolotl, you can fine-tune your model without writing any code – just a YAML configuration file is required. Here is the config.yml we will use:

# ... (YAML configuration remains the same) ...

Copy after login

Set up a cloud environment

To run the training, we will use cloud hosting services such as Runpod or Vultr. Here are what you need:

Docker image: cloned the winglian/axolotl-cloud:main Docker image provided by the Axolotl team.
Hardware requirements: 80GB VRAM GPU (such as 1×A100 PCIe node) is more than enough for models of this size.
Storage: 200GB of volume storage will hold all the files we need.
CUDA version: Your CUDA version should be at least 12.1.

(This type of training is considered a complete fine tuning of LLM and therefore very VRAM-intensive. If you want to run training locally without relying on cloud hosts, you can try using QLoRA, a form of supervised fine tuning. . Although DPO and QLoRA can be combined in theory, this is rare. )

Step to start training

Set the HuggingFace cache directory:

export HF_HOME=/workspace/hf

Copy after login

This ensures that the original model is downloaded to our persistent volume storage.

Create a configuration file: Save the config.yml file we created earlier to /workspace/config.yml.
Start training:

python -m axolotl.cli.train /workspace/config.yml

Copy after login

Look! Your training should begin. After downloading the model and training data in Axolotl, you should see an output similar to this:

# ... (YAML configuration remains the same) ...

Copy after login

Since this is a smaller dataset with only 264 rows, training should take only a few minutes. The fine-tuned model will be saved to /workspace/dpo-output.

Upload the model to HuggingFace

You can use the CLI to upload the model to HuggingFace:

Installing HuggingFace Hub CLI:

export HF_HOME=/workspace/hf

Copy after login

Upload the model:

python -m axolotl.cli.train /workspace/config.yml

Copy after login

Replace yourname/yourrepo with your actual HuggingFace username and repository name.

Evaluate fine-tuned model

For evaluation, it is recommended to use tools such as Text Generation Inference (TGI) to host the original model and fine-tuned model. Then, use the temperature setting to 0 (to ensure deterministic output) to reason about the two models and manually compare the responses of the two models.

This practical approach provides better insights than relying solely on training to evaluate loss metrics, as loss metrics may not capture the subtleties of language generation in LLM.

Conclusion

Fine-tuning LLM with DPO allows you to customize your models to better meet the needs of your application while keeping costs under control. By following the steps outlined in this article, you can leverage the power of open source tools and datasets to create models that meet your specific requirements. Whether you want to adjust the style of your response or implement security measures, DPO offers a practical way to improve LLM.

I wish you a happy fine-tuning!

The above is the detailed content of Fine-Tuning an Open-Source LLM with Axolotl Using Direct Preference Optimization (DPO). For more information, please follow other related articles on the PHP Chinese website!