In recent years, the integration of artificial intelligence into various domains has revolutionized how we interact with technology. One of the most promising advancements is the development of multimodal models capable of understanding and processing both visual and textual information. Among these, the Llama 3.2 Vision Model stands out as a powerful tool for applications that require intricate analysis of images.This article explores the process of fine-tuning the Llama 3.2 Vision Model specifically for extracting calorie information from food images, using Unsloth AI.
This article was published as a part of theData Science Blogathon.
TheLlama 3.2 Vision model, developed by Meta, is a state-of-the-art multimodal large language model designed for advanced visual understanding and reasoning tasks. Here are the key details about the model:
Also Read: Llama 3.2 90B vs GPT 4o: Image Analysis Comparison
Llama 3.2 Vision is designed for various applications, including:
Unsloth AI is an innovative platform designed to enhance the fine-tuning of large language models (LLMs) like Llama-3, Mistral, Phi-3, and Gemma. It aims to streamline the complex process of adapting pre-trained models for specific tasks, making it faster and more efficient.
Unsloth AI represents a significant advancement in AI model training, making it accessible for developers and researchers looking to create high-performance custom models efficiently.
The Llama 3.2 vision models excel at interpreting charts and diagrams.
The 11 billion model surpasses Claude 3 Haiku in visual benchmarks such as MMMU-Pro, Vision (23.7), ChartQA (83.4), AI2 Diagram (91.1) while the 90 Billion model surpasses Claude 3 Haikuin all the visual interpretation tasks.
As a result, Llama 3.2 is an ideal option for tasks that require document comprehension, visual question answering, and extracting data from charts.
In this tutorial, we will walk through the process of fine-tuning the Llama 3.2 11B Vision model. By leveraging its advanced capabilities, we aim to enhance the model’s accuracy in recognizing food items and estimating their caloric content based on visual input.
Fine-tuning this model involves customizing it to better understand the nuances of food imagery and nutritional data, thereby improving its performance in real-world applications. We will delve into the key steps involved in this fine-tuning process, including dataset preparation, and configuring the training environment. We’ll also be employing techniques such as LoRA (Low-Rank Adaptation) to optimize model performance while minimizing resource usage.
We will be leveraging Unsloth AI to customize the model’s capabilities. The dataset we’ll be using consists of food images, each accompanied by information on the calorie content of the various food items. This will allow us to improve the model’s ability to analyze food-related data effectively.
So, let’s begin!
!pip install unsloth
from unsloth import FastVisionModel import torch model, tokenizer = FastVisionModel.from_pretrained( "unsloth/Llama-3.2-11B-Vision-Instruct", load_in_4bit = True, use_gradient_checkpointing = "unsloth", ) model = FastVisionModel.get_peft_model( model, finetune_vision_layers = True, finetune_language_layers = True, finetune_attention_modules = True, finetune_mlp_modules = True, r = 16, lora_alpha = 16, lora_dropout = 0, bias = "none", random_state = 3443, use_rslora = False, loftq_config = None, )
get_peft_model: This method configures the model for fine-tuning using Parameter-Efficient Fine-Tuning (PEFT) techniques.
from datasets import load_dataset dataset = load_dataset("aryachakraborty/Food_Calorie_Dataset", split = "train[0:100]")
We load a dataset on food images along with their calorie description in text.
The dataset has 3 columns – ‘image’, ‘Query’, ‘Response’
def convert_to_conversation(sample): conversation = [ { "role": "user", "content": [ {"type": "text", "text": sample["Query"]}, {"type": "image", "image": sample["image"]}, ], }, { "role": "assistant", "content": [{"type": "text", "text": sample["Response"]}], }, ] return {"messages": conversation} pass converted_dataset = [convert_to_conversation(sample) for sample in dataset]
We convert the dataset into a conversation with two roles involved – user and assistant.
The assistant replies to the user query on the user provided images.
FastVisionModel.for_inference(model) # Enable for inference! image = dataset[0]["image"] messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "You are an expert nutritionist analyzing the image to identify food items and estimate their calorie content and calculate the total calories. Please provide a detailed report in the format: 1. Item 1 - estimated calories 2. Item 2 - estimated calories ..."}, ], } ] input_text = tokenizer.apply_chat_template( messages, add_generation_prompt=True) inputs = tokenizer(image,input_text, add_special_tokens=False,return_tensors="pt",).to("cuda") from transformers import TextStreamer text_streamer = TextStreamer(tokenizer, skip_prompt=True) _ = model.generate( **inputs, streamer=text_streamer, max_new_tokens=500, use_cache=True, temperature=1.5, min_p=0.1 )
Output:
Item 1: Fried Dumplings – 400-600 calories
Item 2: Red Sauce – 200-300 calories
Total Calories – 600-900 calories
Based on serving sizes and ingredients, the estimated calorie count for the two items is 400-600 and 200-300 for the fried dumplings and red sauce respectively. When consumed together, the combined estimated calorie count for the entire dish is 600-900 calories.
Total Nutritional Information:
Conclusion: Based on the ingredients used to prepare the meal, the nutritional information can be estimated.
The output is generated for the below input image:
As seen from the output of the original model, the items mentioned in the text refer to “Fried Dumplings” even though the original input image has “steamed momos” in it. Also, the calories of the lettuce present in the input image is not mentioned in the output from the original model.
Based on serving sizes and ingredients, the estimated calorie count for the two items is 400-600 and 200-300 for the fried dumplings and red sauce respectively. When consumed together, the combined estimated calorie count for the entire dish is 600-900 calories.
Total Nutritional Information:
Conclusion: Based on the ingredients used to prepare the meal, the nutritional information can be estimated.
!pip install unsloth
Also Read: Fine-tuning Llama 3.2 3B for RAG
!pip install unsloth
As seen from the output of the finetuned model, all the three items are correctly mentioned in the text along with their calories in the needed format.
We also test how good the fine-tuned model is on unseen data. So, we select the rows of the data not seen by the model before.
from unsloth import FastVisionModel import torch model, tokenizer = FastVisionModel.from_pretrained( "unsloth/Llama-3.2-11B-Vision-Instruct", load_in_4bit = True, use_gradient_checkpointing = "unsloth", ) model = FastVisionModel.get_peft_model( model, finetune_vision_layers = True, finetune_language_layers = True, finetune_attention_modules = True, finetune_mlp_modules = True, r = 16, lora_alpha = 16, lora_dropout = 0, bias = "none", random_state = 3443, use_rslora = False, loftq_config = None, )
We select this as the input image.
from datasets import load_dataset dataset = load_dataset("aryachakraborty/Food_Calorie_Dataset", split = "train[0:100]")
As we can see from the output of the fine-tuned model, all the components of the pizza have been accurately identified and their calories have been mentioned as well.
The integration of AI models like Llama 3.2 Vision is transforming the way we analyze and interact with visual data, particularly in fields like food recognition and nutritional analysis. By fine-tuning this powerful model with Unsloth AI, we can significantly improve its ability to understand food images and accurately estimate calorie content.
The fine-tuning process, leveraging advanced techniques such as LoRA and the efficient capabilities of Unsloth AI, ensures optimal performance while minimizing resource usage. This approach not only enhances the model’s accuracy but also opens the door for real-world applications in food analysis, health monitoring, and beyond. Through this tutorial, we’ve demonstrated how to adapt cutting-edge AI models for specialized tasks, driving innovation in both technology and nutrition.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
A. The Llama 3.2 Vision model is a multimodal AI model developed by Meta, capable of processing both text and images. It uses a transformer architecture and cross-attention layers to integrate image data with language models, enabling it to perform tasks like visual recognition, captioning, and image-text retrieval.
Q2. How does fine-tuning the Llama 3.2 Vision model improve its performance?A. Fine-tuning customizes the model to specific tasks, such as extracting calorie information from food images. By training the model on a specialized dataset, it becomes more accurate at recognizing food items and estimating their nutritional content, making it more effective in real-world applications.
Q3. What role does Unsloth AI play in the fine-tuning process?A. Unsloth AI enhances the fine-tuning process by making it faster and more efficient. It allows models to be fine-tuned up to 30 times faster while reducing memory usage by 60%. The platform also provides tools for easy setup and scalability, supporting both small teams and enterprise-level applications.
Q4. What is LoRA (Low-Rank Adaptation), and why is it used in the fine-tuning process?A. LoRA is a technique used to optimize model performance while reducing resource usage. It helps fine-tune large language models more efficiently, making the training process faster and less computationally intensive without compromising accuracy. LoRA modifies only a small subset of parameters by introducing low-rank matrices into the model architecture.
Q5. What practical applications can the fine-tuned Llama 3.2 Vision model be used for?A. The fine-tuned model can be used in various applications, including calorie extraction from food images, visual question answering, document understanding, and image captioning. It can significantly enhance tasks that require both visual and textual analysis, especially in fields like health and nutrition.
The above is the detailed content of Fine-Tuning Llama 3.2 Vision for Calorie Extraction from Images. For more information, please follow other related articles on the PHP Chinese website!