In this build, we’re building a tool designed for the logistics industry. This tool will automate the extraction of structured data from PDF attachments (such as requests for quotes or shipping information sheets) in emails, allowing this data to be used elsewhere in the workflow.
To make things easier to understand, let’s use Nova Logistics as an example—a fictional company specializing in transporting fragile electronics across various cities.
At Nova Logistics, customers reach out by email to request quotes for shipping items between cities and they usually attach a PDF that contains all the necessary shipping details. Currently, the process is manual: someone at Nova has to open each email, download the attached PDF, read through it, and extract key information like the item names and quantities before calculating the shipping cost.
This can take hours, especially when there are multiple emails per day, each with lengthy PDF documents.
In this article, we’ll walk through building a tool to automate this entire process—from fetching the emails and extracting the PDF data to sending the extracted information to Google Sheets.
To build this tool, we’ll need the following packages:
Before we start writing the code, we need to set up a few things. Don’t worry; I’ll guide you through each step.
We’ll be using Node.js to run our code. If you don’t have Node.js installed, go to the Node.js website and download the latest version.
Once Node.js is installed, we need to install the packages that will help us interact with Gmail, Google Sheets, Supabase, and Documind.
Create a new folder for your project by running:
mkdir nova cd nova
Initialize the project:
npm init -y
Install the required packages:
npm install googleapis @supabase/supabase-js documind dotenv @nangohq/node
Before we can start writing the code, you need to set up and get all the credentials to use the Google APIs (Gmail and Google Sheets), Supabase and Documind. Here’s a quick guide for each:
Google APIs
Since we’re also using Google Sheets API, you can simply go through step 6 to create another integration on Nango. Search for the Google Sheets integration and use the same Client ID and Secret you copied. In the space for scopes, add https://www.googleapis.com/auth/spreadsheets
To publish your app, go to the OAuth consent screen in the Google console and click on the Publish button.
Supabase
Now let’s write the code in small steps.
Create a .env file to store all important variables that would be used through out the code. Here’s an example:
mkdir nova cd nova
We’ll walk through how to get and use these variables further in the code.
We’ll begin by using the Gmail API to fetch emails that don’t have the Processed label and contain attachments.
To retrieve the necessary access token, we’ll use Nango, which will automatically handle token refreshes if they expire, so you won’t need to worry about managing token lifecycles yourself.
All you need are:
You can easily add a new connection directly through the Nango UI using your own Gmail account. Your secret key can be found in the environment settings section of the Nango dashboard.
npm init -y
For simplicity, we’ll limit the results to just five emails at a time, and we’ll specifically filter to only fetch emails that have PDF attachments. From those, we’ll retrieve just the first attachment for processing. After downloading the attachment, we’ll mark the email as processed by applying a label, ensuring that it won't be fetched again in future polling cycles.
Next, we need upload the downloaded PDFs to Supabase. Make sure you replace the bucket name in the code with yours.
npm install googleapis @supabase/supabase-js documind dotenv @nangohq/node
Once the PDF is stored in Supabase, we’ll use Documind to extract the relevant data. Since it leverages OpenAI for processing, make sure your API Key is added to the .env file.
Documind works with schemas that you define to extract the structured data you need. We’ll go over schema definition shortly, but feel free to check the documentation for more details.
SUPABASE_API_KEY=<Supabase API Key> SUPABASE_URL=<Supabase URL> OPENAI_API_KEY=<Open AI API Key> NANGO_KEY=<Nango secret key>
After extracting the data from the PDF, we’ll send it to Google Sheets.
Before proceeding, ensure that your Google Sheets is set up and you’ve created a connection with your account through Nango. If you haven’t already, here’s a template you can use to get started.
mkdir nova cd nova
Now that we’ve written the individual functions, we need to bring everything together.
In this step, we’ll define the schema that Documind will use to extract the required data. This schema will guide the AI in identifying and structuring the relevant information from the PDFs.
npm init -y
The full source code is available on GitHub, along with a sample PDF for testing. However, you’re welcome to create and use your own documents as well. Simply clone the repository, modify the code to fit your requirements, and try it out for your own use case.
The above is the detailed content of Turn Unstructured Emails to Actionable Data. For more information, please follow other related articles on the PHP Chinese website!