In this guide, I'll show you how to extract structured data from PDFs using vision-language models (VLMs) like Gemini Flash or GPT-4o.
Gemini, Google's latest series of vision-language models, has shown state of the art performance in text and image understanding. This improved multimodal capability and long context window makes it particularly useful for processing visually complex PDF data that traditional extraction models struggle with, such as figures, charts, tables, and diagrams.
By doing so, you can easily build your own data extraction tool for visual file and web extraction. Here's how:
Gemini's long context window and multimodal capability makes it particularly useful for processing visually complex PDF data where traditional extraction models struggle.
Before we dive into extraction, let's set up our development environment. This guide assumes you have Python installed on your system. If not, download and install it from https://www.python.org/downloads/
⚠️ Note that, if you don't want to use Python, you can use the cloud platform at thepi.pe to upload your files and download your result as a CSV without writing any code.
Open your terminal or command prompt and run the following commands:
pip install git+https://github.com/emcf/thepipe pip install pandas
For those new to Python, pip is the package installer for Python, and these commands will download and install the necessary libraries.
To use thepipe, you need an API key.
Disclaimer: While thepi.pe is a free an open source tool, the API has a cost, roughly $0.00002 per token. If you want to avoid such costs, check out the local setup instructions on GitHub. Note that you will still have to pay your LLM provider of choice.
Here's how to get and set it up:
Now, you need to set this as an environment variable. The process varies depending on your operating system:
For Windows:
For macOS and Linux:
Open your terminal and add this line to your shell configuration file (e.g., ~/.bashrc or ~/.zshrc):
export THEPIPE_API_KEY=your_api_key_here
Then, reload your configuration:
source ~/.bashrc # or ~/.zshrc
The key to successful extraction is defining a clear schema for the data you want to pull out. Let's say we're extracting data from a Bill of Quantity document:
An example of a page from the Bill of Quantity document. The data on each page is independent of the other pages, so we do our extraction "per page". There are multiple pieces of data to extract per page, so we set multiple extractions to True
Looking at the column names, we might want to extract a schema like this:
schema = { "item": "string", "unit": "string", "quantity": "int", }
You can modify the schema to your liking on thepi.pe Platform. Clicking "View Schema" will give you a schema you can copy and paste for use with the Python API
Now, let's use extract_from_file to pull data from a PDF:
from thepipe.extract import extract_from_file results = extract_from_file( file_path = "bill_of_quantity.pdf", schema = schema, ai_model = "google/gemini-flash-1.5b", chunking_method = "chunk_by_page" )
Here, we've chunking_method="chunk_by_page" because we want to send each page to the AI model individually (the PDF is too large to feed all at once). We also set multiple_extractions=True because the PDF pages each contain multiple rows of data. Here's what a page from the PDF looks like:
The results of the extraction for the Bill of Quantity PDF as viewed on thepi.pe Platform
The extraction results are returned as a list of dictionaries. We can process these results to create a pandas DataFrame:
import pandas as pd df = pd.DataFrame(results) # Display the first few rows of the DataFrame print(df.head())
This creates a DataFrame with all the extracted information, including textual content and descriptions of visual elements like figures and tables.
Now that we have our data in a DataFrame, we can easily export it to various formats. Here are some options:
df.to_excel("extracted_research_data.xlsx", index=False, sheet_name="Research Data")
This creates an Excel file named "extracted_research_data.xlsx" with a sheet named "Research Data". The index=False parameter prevents the DataFrame index from being included as a separate column.
If you prefer a simpler format, you can export to CSV:
df.to_csv("extracted_research_data.csv", index=False)
This creates a CSV file that can be opened in Excel or any text editor.
The key to successful extraction lies in defining a clear schema and utilizing the AI model's multimodal capabilities. As you become more comfortable with these techniques, you can explore more advanced features like custom chunking methods, custom extraction prompts, and integrating the extraction process into larger data pipelines.
The above is the detailed content of Extracting Data from Tricky PDFs with Google Gemini in lines of Python. For more information, please follow other related articles on the PHP Chinese website!