The emergence of large language models (LLMs) has brought countless new opportunities for AI applications. If you've always wanted to fine-tune your own model, this guide will show you how to do this easily without writing any code. We will use tools such as Axolotl and DPO to guide you through the entire process step by step.
Large Language Model (LLM) is a powerful AI model that trains on massive text data (trillions of characters) to predict the next phrase in a sequence. This has only been possible with advances in GPU computing over the past 2-3 years that have enabled such a large model to be trained in a few weeks.
You may have interacted with LLM before through products such as ChatGPT or Claude and have personally experienced their ability to understand and generate human-like responses.
Can't we just use GPT-4o to handle everything? While it is the most powerful model we have at the time of writing, it is not always the most practical option. Fine-tuning a smaller model (parameter range of 3 billion to 14 billion) can achieve comparable results at a small fraction of the cost. Additionally, fine-tuning allows you to own your own intellectual property and reduce reliance on third parties.
Be sure to understand the different types of LLMs available before tweaking in depth:
Reinforcement learning (RL) is a technique in which models learn by receiving feedback on their behavior. It is applied to instruction models or dialogue models to further improve the quality of its output. Typically, RL does not work on top of the underlying model because it uses a lower learning rate, which is not enough to make significant changes.
DPO is an RL form where the model is trained using the pros and cons of the same prompt/dialogue answer pair. By presenting these pairs, the model learns favorable examples and avoids bad examples.
DPO is especially useful when you want to adjust the style or behavior of your model, for example:
However, DPO is not suitable for teaching new knowledge or facts about models. For this purpose, supervised fine-tuning (SFT) or search-enhanced generation (RAG) techniques are more suitable.
In production environments, you usually use user feedback to generate DPO datasets, such as:
If you lack user data, you can also create synthetic datasets by leveraging larger, more powerful LLMs. For example, you could use a smaller model to generate the wrong answer and then correct it with GPT-4o.
For simplicity, we will use HuggingFace's ready-made dataset: olivermolenschot/alpaca_messages_dpo_test. If you examine the dataset, you will notice that it contains hints with selected and rejected answers – these are good and bad examples. These data were created using GPT-3.5-turbo and GPT-4 synthesis.
You usually need at least 500 to 1000 pairs of data to be trained effectively without overfitting. The largest DPO dataset contains up to 15,000-20,000 pairs of data.
We will use Axolotl to fine-tune the Qwen2.5 3B instruction model, which is currently ranked No. 1 in the OpenLLM rankings in its scale category. With Axolotl, you can fine-tune your model without writing any code – just a YAML configuration file is required. Here is the config.yml we will use:
# ... (YAML configuration remains the same) ...
To run the training, we will use cloud hosting services such as Runpod or Vultr. Here are what you need:
(This type of training is considered a complete fine tuning of LLM and therefore very VRAM-intensive. If you want to run training locally without relying on cloud hosts, you can try using QLoRA, a form of supervised fine tuning. . Although DPO and QLoRA can be combined in theory, this is rare. )
export HF_HOME=/workspace/hf
This ensures that the original model is downloaded to our persistent volume storage.
Create a configuration file: Save the config.yml file we created earlier to /workspace/config.yml.
Start training:
python -m axolotl.cli.train /workspace/config.yml
Look! Your training should begin. After downloading the model and training data in Axolotl, you should see an output similar to this:
# ... (YAML configuration remains the same) ...
Since this is a smaller dataset with only 264 rows, training should take only a few minutes. The fine-tuned model will be saved to /workspace/dpo-output.
You can use the CLI to upload the model to HuggingFace:
export HF_HOME=/workspace/hf
python -m axolotl.cli.train /workspace/config.yml
Replace yourname/yourrepo with your actual HuggingFace username and repository name.
For evaluation, it is recommended to use tools such as Text Generation Inference (TGI) to host the original model and fine-tuned model. Then, use the temperature setting to 0 (to ensure deterministic output) to reason about the two models and manually compare the responses of the two models.
This practical approach provides better insights than relying solely on training to evaluate loss metrics, as loss metrics may not capture the subtleties of language generation in LLM.
Fine-tuning LLM with DPO allows you to customize your models to better meet the needs of your application while keeping costs under control. By following the steps outlined in this article, you can leverage the power of open source tools and datasets to create models that meet your specific requirements. Whether you want to adjust the style of your response or implement security measures, DPO offers a practical way to improve LLM.
I wish you a happy fine-tuning!
The above is the detailed content of Fine-Tuning an Open-Source LLM with Axolotl Using Direct Preference Optimization (DPO). For more information, please follow other related articles on the PHP Chinese website!