Weekend Coding: Turn PDF Payslips Into a Single CSV Report-Python Tutorial-php.cn

Ever programmed with PDF files? Write a Python script with me!

Weekend Coding: Turn PDF Payslips Into a Single CSV Report

As you will see in the upcoming blog posts, I am in my financial literacy era. With the end of the year coming, I wanted to look at my numbers: How much taxes did I pay? How much did I earn for on-call shifts? Multiple PDF files are not the most comfortable way to see this data, and I wanted a single CSV file I could play with in Excel.

Like many good developers, I was too lazy to insert the numbers manually, so I wrote a script. If you enjoy programming — join me on an adventure! And if you are not in the mood — I’ll show you how to tweak the code to match your payslip structure :D


This script receives a directory with payslip PDFs and returns a CSV file with the desired data

The Plan

main.py:
# translate 1 pdf to 1 dict
# loop over the pdf dir
# save all dicts to 1 json file
# translate json report to csv report

Copy after login

We will start by writing the code to read a PDF, including deciding which fields we want in our report. This is the part you’ll need to tweak to match your payslip structure. Once that is figured out, we’ll iterate the whole payslips directory.

In the third step, I chose to add an extra step between the PDF and the CSV — a JSON report. Once we see everything works, we will remove the use of that file.

Lastly, we’ll translate that JSON data into a CSV file. That CSV can then be easily converted to Google Sheets (just click "open with") or to Excel (instructions can be found here).

That’s an easy nice plan, but you know how it goes — challenges are discovered along the way… Can you guess where things might get complex?

Before we start — an important note: KEEP YOUR PAYSLIPS PRIVATE! If you upload your project to GitHub — be sure not to share those personal details! You can use .gitignore for this:

/payslips_pdf
pdf_rows.txt
report.json
report.csv

Copy after login

Shall we begin?

Weekend Coding: Turn PDF Payslips Into a Single CSV Report

Read PDF File

We’ll start by reading the PDF and printing all the rows. That way we will know what appears in each row. This only needs to be done once (while a report will probably be created once a month or once a year), and it’s not part of the report -so we will create that in a separate file.

Start by creating a new Python file (I called mine pdf_to_txt.py) and write a function that reads the pdf and prints the result into a .txt file:

This will work, but there are 3 changes to make it more user-friendly:

Get the file path as a command line argument, so the user can run it without touching the code.
Add some error-catching with instructions in case the user runs the command wrong. (I added a warning color but you don’t have to)
We will read PDF files in the main script as well, so moving this function there is better.

Give it a go! Try running the command with the right and the wrong arguments :)

When you run py ./main.py you will see a new file in the project directory called pdf_rows.txt.

Process The PDF Data

Now that we know the structure in which the PDF is read — we can fish the desired values. In my case — here is the information I was interested in:

Weekend Coding: Turn PDF Payslips Into a Single CSV Report

Notice there is data within the table (categories that might vary every month) and data outside the table.

Data outside the table:

Pay Period — Can be found on row 19

Gross Pay — This one was tricky to find a rule for because it appears after the payments list and doesn’t have the title “Gross Pay”.

As mentioned earlier, payments and deductions can vary, and not every month is the same. Therefore, gross pay might appear in a different row in different months.

I did notice it appears right after the employee name — so that is what I used. Start by adding it hard-coded, and later we will get it externally.

Nett Pay: This one is easy — it appears in line 17.

I gathered those out-of-table values into a function:

Data inside the table

Payment and Deduction Details: This is the juicy part! We’ll start by cutting the rows array to save a few milliseconds in the for loop coming up. Then, I needed to differentiate between list items

and other rows.

I’ve noticed that within the whole file, list items are the only ones that match this rule: Start with an alphabetic character and end with a numeric character and contain a space (the last condition is to filter out wrong rows in my
payslip, you might not need that).

For example, we’ll look at the pension item:

main.py:
# translate 1 pdf to 1 dict
# loop over the pdf dir
# save all dicts to 1 json file
# translate json report to csv report

Copy after login

I don’t care about the balance (the number on the right), but I do care about the code (G means it’s deducted from the Gross pay — before taxes — and N means it is deducted from the Nett pay — after taxes). So ideally, we’ll have json_obj["Pension (G)"]=150.00.

We’ll use the spaces to split the line. It’s good there are duplicate spaces — that way we can differentiate between space splitting between a couple of words and space splitting between a couple of fields.

The description:

We will find the first double-space and split by it.

The code:

The amount of spaces is dependent on the length of the description, so we can’t know in advance how many are there — that’s why I’ll use lstrip() as well. Now the rest of the line starts with a non-space character.

Not all list items have a code, so we want to check if the line starts with a code or a digit. If it’s a code — I wrap it in () (including a space before the opening parenthesis) , and attach it to the description string. and if not — add nothing.

The amount:

If there was code — we’ll have more spaces to strip. If not, our line might contain two amounts: The monthly and the balance.

There are 4 cases I’ve noticed:

/payslips_pdf
pdf_rows.txt
report.json
report.csv

Copy after login

After extracting the category and code, we are left with:

PENSION     G   150.00   587.49

Copy after login

To cover cases 2–3, we’ll find the index of the spaces separating the amounts and cut the tail. It also works for the first case, where there is no space (aka no tail).

To cover case 4, I’m relying on the difference between two types of categories with a single amount in the row: The first one is like the salary — where we want to save the amount, and the second type is like the tax withholds — where we want to ignore it. The difference is that only deductions keep track of the annual balance in the table — so I am checking for -.

All together, that’s how it looks like:

Write to JSON File

This is not a mandatory step — we can work with a JSON object without exporting the values. I prefer seeing what it looks like, at least for the coding stage.

Scale to Multiple PDF Files

The only reason this step is getting a dedicated section is because wrapping the pdf_to_dict in a for loop reveals an unpleasant surprise. To demonstrate it, I created a function called iterate_over_pdfs():

Originally, I thought I’d have to rename the files (Payslip1.pdf -> Payslip01.pdf), but there is a better solution:

Once we sort the list of file names by length, 10 will appear after 2. On the last line, I decoded the names to get rid of the b'' default structure.

Create the CSV Report

Because the items in payments and deductions might vary from payslip to payslip, this section is more than just direct translation. CSV is a relational dataset, which means we need to know in advance all the categories in payments and deductions and keep the entry empty for a payslip where it doesn’t exist. JSON, on the other hand, is non-relational and each entry specifies its keys.

With that in mind, the first step in our CSV report is to collect the categories. All the categories.

Collect the categories:

Now, at first glance, you might think to use Set for that — because we want all categories to appear only once. I’ve tried that. The problem with this is that sets are unlisted, and I find it important to match the order of items that appear in the original payslips. When using lists, don’t forget to check if the item exists in the list before appending it:

Now that we have this figured out: Remember earlier when we said we don’t care about which item list is a payment and which is a deduction? Well, we do care now! We don’t

have to
separate, but I’d expect a payslip report to have all payments on the right and all deductions on the left, not mixed.

Although each payslip might have different list items — some will always exist (because you’ll always pay your taxes ;) ). We can use this to our benefit — and flag PAYE as the start of the deductions! (I'm pretty sure PAYE is only in Ireland, so you'll need to change it to match your payslip)

Finally, I return a single list, because there is no use in separating the payments from the deductions — the split was to assure payments will appear on the right and deductions will appear on the left.

Populate the CSV table:

Now that we have the categories, we can start populating the CSV table:

Each payslip will be a row, and each row will have the fields in a specific order split by a comma. I find it easier to organize the fields in a list, and then join them. Fields that appear in the categories but not in the payslip — will remain empty:

You can make it easier to read by downloading the VS extension RainbowCSV (or any parallel of another IDE)

Remove the use of the .json file

Once we know things work, we don’t need to write to and read from the JSON file — we can use the JSON object directly:

Instead of using json_object (as in json_object = json.dumps(json_payslips)) – we will use json_payslips directly:

Write: No need to write to report.json - we can remove this section.
Read: Pass json_payslips directly to json_to_csv() function:

Get the employee’s name as an argument

Once you have your script ready — you’ll want to share it with your colleagues and friends! For a nice user experience, we’ll export the employee name to come from the command line, rather than ask them to open the code.

Read argument

We’ll start with the happy path — assuming the user entered the employee name — and add code that uses it:

In pdf_to_dict(), instead of hard coding EMPLOYEE_NAME = "IFAT NEUMANN", we'll read it from the arguments: employee_name = sys.argv[1]. Don't forget to import sys!

Now we’ll think of other scenarios:

No employee name was given

What if the user did not enter any employee name? We’d wanna catch it as soon as possible, and notify them!

Therefore, we’ll add a check in the first line of the main function. Now, the intuition is to initialize the employee_name variable there - but this will cause bubbling function properties until it reaches the function that uses this variable - and I don't find it a very clean approach.

Lastly, I’ll just try accessing this field — and catch if it’s not there:

main.py:
# translate 1 pdf to 1 dict
# loop over the pdf dir
# save all dicts to 1 json file
# translate json report to csv report

Copy after login

Note that adding exceptions means the print_warning() function moves to main.py. Otherwise, you’ll get an error:

Employee name without quotes

We are asking the user to wrap the name with quotes because command line arguments are split by spaces. The only argument we expect is the employee name, so if there is another argument — we know they did not use quotes and we can notify them:

You can skip the quotation requirements and loop over arguments, collecting all the parts of the user name — but I find this approach adds unnecessary complexity.

The employee’s name doesn’t appear on the payslip

We won’t be able to find gross pay if we don’t have the employee’s name as it appears on the payslip.

The earlier spot to identify the mismatch is when we read the pdf. Therefore — well add the check at the beginning of the pdf_to_dict() function:

Lastly, we catch the error back in the main function:

And to wrap it all up — don’t forget to update your README file with the new instructions.

The full script

Here is the full code you can use for playing with your payslips:


https://cupofcode.blog/

Hope you enjoyed this one! Interested in another weekend coding project? Check out my automated email-sending blog post!

Blogging is my hobby, so I happily spend time and money on it. If you enjoyed this blog post, putting 1 euro in my tipping jar will let me know :) Thank you for your support! <script></script> <script></script>

The above is the detailed content of Weekend Coding: Turn PDF Payslips Into a Single CSV Report. For more information, please follow other related articles on the PHP Chinese website!