PaliGemma 2 Mix is a multimodal AI model developed by Google. It is an improved version of the PaliGemma vision language model (VLM), integrating advanced capabilities from the SigLIP vision model and Gemma 2 language models.
In this tutorial, I’ll explain how to use PaliGemma 2 Mix to build an AI-powered bill scanner and spending analyzer capable of:
While our focus will be on building a financial insights tool, you can use what you learn in this blog to explore other use cases of PaliGemma 2 Mix, such as image segmentation, object detection, and question answering.
PaliGemma 2 Mix is an advanced vision-language model (VLM) that processes both images and text as input and generates text-based outputs. It is designed to handle a diverse range of multimodal AI tasks while supporting multiple languages.
PaliGemma 2 is designed for a wide array of vision-language tasks, including image and short video captioning, visual question answering, optical character recognition (OCR), object detection, and segmentation.
Source of the images used in the diagram: Google
PaliGemma 2 Mix model is designed for:
You can find more information about the PaliGemma 2 Mix model in the official release article.
Let’s outline the main steps that we’re going to take:
Before we start, let’s ensure that we have the following tools and libraries installed:
Run the following commands to install the necessary dependencies:
pip install gradio -U bitsandbytes -U transformers -q
Once the above dependencies are installed, run the following import commands:
import gradio as gr import torch import pandas as pd import matplotlib.pyplot as plt from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor, BitsAndBytesConfig from transformers import BitsAndBytesConfig from PIL import Image import re
We configure and load the PaliGemma 2 Mix model with quantization to optimize performance. For this demo, we’ll be using the 10b parameter model with 448 x 448 input image resolution. You need a minimum of T4 GPU with 40GB memory (Colab configuration) to run this model.
device = "cuda" if torch.cuda.is_available() else "cpu" # Model setup model_id = "google/paligemma2-10b-mix-448" bnb_config = BitsAndBytesConfig( load_in_8bit=True, # Change to load_in_4bit=True for even lower memory usage llm_int8_threshold=6.0, ) # Load model with quantization model = PaliGemmaForConditionalGeneration.from_pretrained( model_id, quantization_config=bnb_config ).eval() # Load processor processor = PaliGemmaProcessor.from_pretrained(model_id) # Print success message print("Model and processor loaded successfully!")
BitsAndBytes quantization helps to reduce memory usage while maintaining performance, making it possible to run large models on limited GPU resources. In this implementation, we use 4-bit quantization to further optimize memory efficiency.
We load the model using the PaliGemmaForConditionalGeneration class from the transformers library by passing in the model ID and quantization configuration. Similarly, we load the processor, which preprocesses the inputs into tensors before passing it to the model.
Once the model shards are loaded, we process the images before passing them to the model to maintain image format compatibility and gain uniformity. We convert images to RGB format:
def ensure_rgb(image: Image.Image) -> Image.Image: if image.mode != "RGB": image = image.convert("RGB") return image
Now, our images are ready for inference.
Now, we set up the main function for running inference with the model. This function takes in input images and questions, incorporates them into prompts, and passes them to the model via the processor for inference.
def ask_model(image: Image.Image, question: str) -> str: prompt = f"<image> answer en {question}" inputs = processor(text=prompt, images=image, return_tensors="pt").to(device) with torch.inference_mode(): generated_ids = model.generate( **inputs, max_new_tokens=50, do_sample=False ) result = processor.batch_decode(generated_ids, skip_special_tokens=True) return result[0].strip()
Now that we have the main function ready, we’ll work next on extracting the key parameters from the image—in our case, these are the total amount and goods category.
pip install gradio -U bitsandbytes -U transformers -q
The extract_total_amount() function processes an image to extract the total amount from a receipt using OCR. It constructs a query (question) instructing the model to extract only numerical values, and then it calls the ask_model() function to generate a response from the model.
import gradio as gr import torch import pandas as pd import matplotlib.pyplot as plt from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor, BitsAndBytesConfig from transformers import BitsAndBytesConfig from PIL import Image import re
The categorize_goods() function classifies the type of goods in an image by prompting the model with a predefined question listing possible categories: grocery, clothing, electronics, or other. The ask_model() function then processes the image and returns a textual response. If the processed response matches any of the predefined valid categories, it returns that category—otherwise, it defaults to the "Other" category.
We have all the key functions ready, so let’s analyse the outputs.
device = "cuda" if torch.cuda.is_available() else "cpu" # Model setup model_id = "google/paligemma2-10b-mix-448" bnb_config = BitsAndBytesConfig( load_in_8bit=True, # Change to load_in_4bit=True for even lower memory usage llm_int8_threshold=6.0, ) # Load model with quantization model = PaliGemmaForConditionalGeneration.from_pretrained( model_id, quantization_config=bnb_config ).eval() # Load processor processor = PaliGemmaProcessor.from_pretrained(model_id) # Print success message print("Model and processor loaded successfully!")
The above function creates a pie chart to visualize spending distribution across different categories. If no valid spending data exists, it generates a blank figure with a message indicating "No Spending Data." Otherwise, it creates a pie chart with category labels and percentage values, ensuring a proportional and well-aligned visualization.
We typically have multiple bills to analyze, so let’s create a function to process all our bills simultaneously.
def ensure_rgb(image: Image.Image) -> Image.Image: if image.mode != "RGB": image = image.convert("RGB") return image
For analyzing multiple bills at once, we perform the following steps:
Now, we have all key logic functions in place. Next, we work on building interactive UI with Gradio.
def ask_model(image: Image.Image, question: str) -> str: prompt = f"<image> answer en {question}" inputs = processor(text=prompt, images=image, return_tensors="pt").to(device) with torch.inference_mode(): generated_ids = model.generate( **inputs, max_new_tokens=50, do_sample=False ) result = processor.batch_decode(generated_ids, skip_special_tokens=True) return result[0].strip()
The above code creates a structured Gradio UI with a file uploader for multiple images and a submit button to trigger processing. Upon submission, uploaded bill images are displayed in a gallery, extracted data is shown in a table, total spending is summarized in text, and a spending distribution pie chart is generated.
The function connects user inputs to the process_multiple_bills() function, ensuring seamless data extraction and visualization. Finally, the demo.launch() function starts the Gradio app for real-time interaction.
I also tried this demo with two image-based bills (Amazon shopping invoice) and got the following results.
Note: VLMs find it difficult to extract numbers, which may lead to incorrect results at times. For instance, it extracted the wrong total amount for the second bill below. This is correctable with the use of larger models or simply fine-tuning the existing ones.
In this tutorial, we built an AI-powered multiple bill scanner using PaliGemma 2 Mix, which can help us extract and categorize our expenses from receipts. We used PaliGemma 2 Mix’s vision-language capabilities for OCR and classification to analyze spending insights effortlessly. I encourage you to adapt this tutorial to your own use case.
The above is the detailed content of PaliGemma 2 Mix: A Guide With Demo OCR Project. For more information, please follow other related articles on the PHP Chinese website!