Fine-tuning large language models (LLMs) like BERT, Llama, BART, and those from Mistral AI and others can be computationally intensive. Lacking a local GPU, Google Colab provides a free alternative, but its transient nature presents challenges in preserving your progress. This guide demonstrates how to leverage Google Drive to overcome this limitation, enabling you to save and resume your LLM training across multiple Colab sessions.
The solution involves using Google Drive to store intermediate results and model checkpoints. This ensures your work persists even after the Colab environment is reset. You'll need a Google account with sufficient Drive space. Create two folders in your Drive: "data" (for your training dataset) and "checkpoints" (to store model checkpoints).
Mounting Google Drive in Colab:
Begin by mounting your Google Drive within your Colab notebook using this command:
from google.colab import drive drive.mount('/content/drive')
Verify access by listing the contents of your data and checkpoints directories:
!ls /content/drive/MyDrive/data !ls /content/drive/MyDrive/checkpoints
If authorization is required, a pop-up window will appear. Ensure you grant the necessary access permissions. If the commands fail, re-run the mounting cell and check your permissions.
Saving and Loading Checkpoints:
The core of the solution lies in creating functions to save and load model checkpoints. These functions will serialize your model's state, optimizer, scheduler, and other relevant information to your "checkpoints" folder.
Save Checkpoint Function:
import torch import os def save_checkpoint(epoch, model, optimizer, scheduler, loss, model_name, overwrite=True): checkpoint = { 'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'scheduler_state_dict': scheduler.state_dict(), 'loss': loss } direc = get_checkpoint_dir(model_name) #Assumed function to construct directory path if overwrite: file_path = os.path.join(direc, 'checkpoint.pth') else: file_path = os.path.join(direc, f'epoch_{epoch}_checkpoint.pth') os.makedirs(direc, exist_ok=True) # Create directory if it doesn't exist torch.save(checkpoint, file_path) print(f"Checkpoint saved at epoch {epoch}") #Example get_checkpoint_dir function (adapt to your needs) def get_checkpoint_dir(model_name): return os.path.join("/content/drive/MyDrive/checkpoints", model_name)
Load Checkpoint Function:
import torch import os def load_checkpoint(model_name, model, optimizer, scheduler): direc = get_checkpoint_dir(model_name) if os.path.exists(direc): #Find checkpoint with highest epoch (adapt to your naming convention) checkpoints = [f for f in os.listdir(direc) if f.endswith('.pth')] if checkpoints: latest_checkpoint = max(checkpoints, key=lambda x: int(x.split('_')[-2]) if '_' in x else 0) file_path = os.path.join(direc, latest_checkpoint) checkpoint = torch.load(file_path, map_location=torch.device('cpu')) model.load_state_dict(checkpoint['model_state_dict']) optimizer.load_state_dict(checkpoint['optimizer_state_dict']) scheduler.load_state_dict(checkpoint['scheduler_state_dict']) epoch = checkpoint['epoch'] loss = checkpoint['loss'] print(f"Checkpoint loaded from epoch {epoch}") return epoch, loss else: print("No checkpoints found in directory.") return 0, None else: print(f"No checkpoint directory found for {model_name}, starting from epoch 1.") return 0, None
Integrating into your Training Loop:
Integrate these functions into your training loop. The loop should check for existing checkpoints before starting training. If a checkpoint is found, it resumes training from the saved epoch.
EPOCHS = 10 for exp in experiments: # Assuming 'experiments' is a list of your experiment configurations model, optimizer, scheduler = initialise_model_components(exp) # Your model initialization function train_loader, val_loader = generate_data_loaders(exp) # Your data loader function start_epoch, prev_loss = load_checkpoint(exp, model, optimizer, scheduler) for epoch in range(start_epoch, EPOCHS): print(f'Epoch {epoch + 1}/{EPOCHS}') # YOUR TRAINING CODE HERE... (training loop) save_checkpoint(epoch + 1, model, optimizer, scheduler, train_loss, exp) #Save after each epoch
This structure allows for seamless resumption of training even if the Colab session terminates. Remember to adapt the get_checkpoint_dir
function and checkpoint file naming conventions to match your specific needs. This improved example handles potential errors more gracefully and provides a more robust solution. Remember to replace placeholder functions (initialise_model_components
, generate_data_loaders
) with your actual implementations.
The above is the detailed content of Training Language Models on Google Colab. For more information, please follow other related articles on the PHP Chinese website!