This tutorial demonstrates fine-tuning Google's Gemma 2 model on a patient-doctor conversation dataset and deploying it for offline use. We'll cover model preparation, fine-tuning with LoRA, model merging, quantization, and local deployment with the Jan application.
Understanding Gemma 2
Gemma 2, Google's latest open-source large language model (LLM), offers 9B and 27B parameter versions under a permissive license. Its improved architecture provides faster inference across various hardware, integrating seamlessly with Hugging Face Transformers, JAX, PyTorch, and TensorFlow. Enhanced safety features and ethical AI deployment tools are also included.
Accessing and Running Gemma 2
This section details downloading and running inference with 4-bit quantization (necessary for memory efficiency on consumer hardware).
Install packages: Install bitsandbytes
, transformers
, and accelerate
.
Hugging Face Authentication: Use a Hugging Face token (obtained from your Hugging Face account) to authenticate.
Load Model and Tokenizer: Load the google/gemma-2-9b-it
model using 4-bit quantization and appropriate device mapping.
Inference: Create a prompt, tokenize it, generate a response, and decode it.
Fine-tuning Gemma 2 with LoRA
This section guides you through fine-tuning Gemma 2 on a healthcare dataset using LoRA (Low-Rank Adaptation) for efficient training.
Setup: Install required packages (transformers
, datasets
, accelerate
, peft
, trl
, bitsandbytes
, wandb
). Authenticate with Hugging Face and Weights & Biases.
Model and Tokenizer Loading: Load Gemma 2 (9B-It) with 4-bit quantization, adjusting data type and attention implementation based on your GPU capabilities. Configure LoRA parameters.
Dataset Loading: Load and preprocess the lavita/ChatDoctor-HealthCareMagic-100k
dataset, creating a chat format suitable for the model.
Training: Set training arguments (adjust hyperparameters as needed) and train the model using the SFTTrainer
. Monitor training progress with Weights & Biases.
Evaluation: Finish the Weights & Biases run to generate an evaluation report.
Saving the Model: Save the fine-tuned LoRA adapter locally and push it to the Hugging Face Hub.
Merging the Adapter and Base Model
This step merges the fine-tuned LoRA adapter with the base Gemma 2 model for a single, deployable model. This is done on a CPU to manage memory constraints.
Setup: Create a new notebook (CPU-based), install necessary packages, and authenticate with Hugging Face.
Load and Merge: Load the base model and the saved adapter, then merge them using PeftModel.merge_and_unload()
.
Save and Push: Save the merged model and tokenizer locally and push them to the Hugging Face Hub.
Quantizing with Hugging Face Space
Use the GGUF My Repo Hugging Face Space to easily convert and quantize the model to the GGUF format for optimal local deployment.
Using the Fine-tuned Model Locally with Jan
Download and install the Jan application.
Download the quantized model from the Hugging Face Hub.
Load the model in Jan, adjust parameters (stop sequences, penalties, max tokens, instructions), and interact with the fine-tuned model.
Conclusion
This tutorial provides a comprehensive guide to fine-tuning and deploying Gemma 2. Remember to adjust hyperparameters and settings based on your hardware and dataset. Consider exploring Keras 3 for potentially faster training and inference.
The above is the detailed content of Fine-Tuning Gemma 2 and Using it Locally. For more information, please follow other related articles on the PHP Chinese website!