Meta's LLaMA sparked a surge in Large Language Model (LLM) development, aiming to rival models like GPT-3.5. The open-source community rapidly produced increasingly powerful models, but these advancements weren't without challenges. Many open-source LLMs had restrictive licenses (research use only), required substantial budgets for fine-tuning, and were expensive to deploy.
LLaMA's new iteration addresses these issues with a commercial license and new methods enabling fine-tuning on consumer-grade GPUs with limited memory. This democratizes AI, allowing even smaller organizations to create tailored models.
This guide demonstrates fine-tuning Llama-2 on Google Colab, utilizing efficient techniques to overcome resource constraints. We'll explore methodologies that minimize memory usage and accelerate training.
Image generated by Author using DALL-E 3
Fine-Tuning Llama-2: A Step-by-Step Guide
This tutorial fine-tunes the 7-billion parameter Llama-2 model on a T4 GPU (available on Google Colab or Kaggle). The T4's 16GB VRAM necessitates parameter-efficient fine-tuning, specifically using QLoRA (4-bit precision). We'll utilize the Hugging Face ecosystem (transformers, accelerate, peft, trl, bitsandbytes).
1. Setup:
Install necessary libraries:
<code>%%capture %pip install accelerate peft bitsandbytes transformers trl</code>
Import modules:
<code>import os import torch from datasets import load_dataset from transformers import ( AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline, logging, ) from peft import LoraConfig from trl import SFTTrainer</code>
2. Model & Dataset Selection:
We'll use NousResearch/Llama-2-7b-chat-hf
(a readily accessible equivalent to the official Llama-2) as the base model and mlabonne/guanaco-llama2-1k
as our smaller training dataset.
<code>base_model = "NousResearch/Llama-2-7b-chat-hf" guanaco_dataset = "mlabonne/guanaco-llama2-1k" new_model = "llama-2-7b-chat-guanaco"</code>
Images illustrating the Hugging Face model and dataset are included here, same as original.
3. Loading Data & Model:
Load the dataset:
<code>dataset = load_dataset(guanaco_dataset, split="train")</code>
Configure 4-bit quantization using QLoRA:
<code>compute_dtype = getattr(torch, "float16") quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=False, )</code>
Load the Llama-2 model with 4-bit quantization:
<code>model = AutoModelForCausalLM.from_pretrained( base_model, quantization_config=quant_config, device_map={"": 0} ) model.config.use_cache = False model.config.pretraining_tp = 1</code>
Load the tokenizer:
<code>tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right"</code>
Image illustrating QLoRA is included here, same as original.
4. PEFT Configuration:
Define PEFT parameters for efficient fine-tuning:
<code>peft_params = LoraConfig( lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM", )</code>
5. Training Parameters:
Set training hyperparameters (output directory, epochs, batch sizes, learning rate, etc.). Details are the same as the original.
6. Fine-tuning with SFT:
Use the SFTTrainer
from the TRL library for supervised fine-tuning:
<code>trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_params, dataset_text_field="text", max_seq_length=None, tokenizer=tokenizer, args=training_params, packing=False, ) trainer.train() trainer.model.save_pretrained(new_model) trainer.tokenizer.save_pretrained(new_model)</code>
Screenshots showing training progress and model saving are included here, same as original.
7. Evaluation:
Use the transformers
pipeline to test the fine-tuned model. Examples are provided, same as original.
8. Tensorboard Visualization:
Launch Tensorboard to monitor training metrics.
<code>%%capture %pip install accelerate peft bitsandbytes transformers trl</code>
Screenshot of Tensorboard is included here, same as original.
Conclusion:
This guide showcases efficient Llama-2 fine-tuning on limited hardware. The use of QLoRA and other techniques makes advanced LLMs accessible to a wider audience. Further resources and learning paths are mentioned at the end, similar to the original, but without the marketing calls to action.
The above is the detailed content of Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model. For more information, please follow other related articles on the PHP Chinese website!