As you will see in the upcoming blog posts, I am in my financial literacy era. With the end of the year coming, I wanted to look at my numbers: How much taxes did I pay? How much did I earn for on-call shifts? Multiple PDF files are not the most comfortable way to see this data, and I wanted a single CSV file I could play with in Excel.
Like many good developers, I was too lazy to insert the numbers manually, so I wrote a script. If you enjoy programming — join me on an adventure! And if you are not in the mood — I’ll show you how to tweak the code to match your payslip structure :D
This script receives a directory with payslip PDFs and returns a CSV file with the desired data |
main.py: # translate 1 pdf to 1 dict # loop over the pdf dir # save all dicts to 1 json file # translate json report to csv report
We will start by writing the code to read a PDF, including deciding which fields we want in our report. This is the part you’ll need to tweak to match your payslip structure. Once that is figured out, we’ll iterate the whole payslips directory.
In the third step, I chose to add an extra step between the PDF and the CSV — a JSON report. Once we see everything works, we will remove the use of that file.
Lastly, we’ll translate that JSON data into a CSV file. That CSV can then be easily converted to Google Sheets (just click "open with") or to Excel (instructions can be found here).
That’s an easy nice plan, but you know how it goes — challenges are discovered along the way… Can you guess where things might get complex?
Before we start — an important note: KEEP YOUR PAYSLIPS PRIVATE! If you upload your project to GitHub — be sure not to share those personal details! You can use .gitignore for this:
/payslips_pdf pdf_rows.txt report.json report.csv
Shall we begin?
We’ll start by reading the PDF and printing all the rows. That way we will know what appears in each row. This only needs to be done once (while a report will probably be created once a month or once a year), and it’s not part of the report -so we will create that in a separate file.
Start by creating a new Python file (I called mine pdf_to_txt.py) and write a function that reads the pdf and prints the result into a .txt file:
We will read PDF files in the main script as well, so moving this function there is better.
Now that we know the structure in which the PDF is read — we can fish the desired values. In my case — here is the information I was interested in:
Notice there is data within the table (categories that might vary every month) and data outside the table.
Data outside the table:
Pay Period — Can be found on row 19
Gross Pay — This one was tricky to find a rule for because it appears after the payments list and doesn’t have the title “Gross Pay”.
As mentioned earlier, payments and deductions can vary, and not every month is the same. Therefore, gross pay might appear in a different row in different months.
I did notice it appears right after the employee name — so that is what I used. Start by adding it hard-coded, and later we will get it externally.
Nett Pay: This one is easy — it appears in line 17.
I gathered those out-of-table values into a function:
Payment and Deduction Details: This is the juicy part! We’ll start by cutting the rows array to save a few milliseconds in the for loop coming up. Then, I needed to differentiate between list items
and other rows.
I’ve noticed that within the whole file, list items are the only ones that match this rule: Start with an alphabetic character and end with a numeric character and contain a space (the last condition is to filter out wrong rows in my
payslip, you might not need that).
For example, we’ll look at the pension item:
main.py: # translate 1 pdf to 1 dict # loop over the pdf dir # save all dicts to 1 json file # translate json report to csv report
I don’t care about the balance (the number on the right), but I do care about the code (G means it’s deducted from the Gross pay — before taxes — and N means it is deducted from the Nett pay — after taxes). So ideally, we’ll have json_obj["Pension (G)"]=150.00.
We’ll use the spaces to split the line. It’s good there are duplicate spaces — that way we can differentiate between space splitting between a couple of words and space splitting between a couple of fields.
The description:
We will find the first double-space and split by it.
The code:
The amount of spaces is dependent on the length of the description, so we can’t know in advance how many are there — that’s why I’ll use lstrip() as well. Now the rest of the line starts with a non-space character.
Not all list items have a code, so we want to check if the line starts with a code or a digit. If it’s a code — I wrap it in () (including a space before the opening parenthesis) , and attach it to the description string. and if not — add nothing.
The amount:
If there was code — we’ll have more spaces to strip. If not, our line might contain two amounts: The monthly and the balance.
There are 4 cases I’ve noticed:
/payslips_pdf pdf_rows.txt report.json report.csv
After extracting the category and code, we are left with:
PENSION G 150.00 587.49
To cover cases 2–3, we’ll find the index of the spaces separating the amounts and cut the tail. It also works for the first case, where there is no space (aka no tail).
To cover case 4, I’m relying on the difference between two types of categories with a single amount in the row: The first one is like the salary — where we want to save the amount, and the second type is like the tax withholds — where we want to ignore it. The difference is that only deductions keep track of the annual balance in the table — so I am checking for -.
All together, that’s how it looks like:
This is not a mandatory step — we can work with a JSON object without exporting the values. I prefer seeing what it looks like, at least for the coding stage.
Scale to Multiple PDF Files
Originally, I thought I’d have to rename the files (Payslip1.pdf -> Payslip01.pdf), but there is a better solution:
Because the items in payments and deductions might vary from payslip to payslip, this section is more than just direct translation. CSV is a relational dataset, which means we need to know in advance all the categories in payments and deductions and keep the entry empty for a payslip where it doesn’t exist. JSON, on the other hand, is non-relational and each entry specifies its keys.
With that in mind, the first step in our CSV report is to collect the categories. All the categories.
Collect the categories:
Now, at first glance, you might think to use Set for that — because we want all categories to appear only once. I’ve tried that. The problem with this is that sets are unlisted, and I find it important to match the order of items that appear in the original payslips. When using lists, don’t forget to check if the item exists in the list before appending it:
have to
separate, but I’d expect a payslip report to have all payments on the right and all deductions on the left, not mixed.
Finally, I return a single list, because there is no use in separating the payments from the deductions — the split was to assure payments will appear on the right and deductions will appear on the left.
Populate the CSV table:
Now that we have the categories, we can start populating the CSV table:
You can make it easier to read by downloading the VS extension RainbowCSV (or any parallel of another IDE)
Once we know things work, we don’t need to write to and read from the JSON file — we can use the JSON object directly:
Instead of using json_object (as in json_object = json.dumps(json_payslips)) – we will use json_payslips directly:
Write: No need to write to report.json - we can remove this section.
Read: Pass json_payslips directly to json_to_csv() function:
Once you have your script ready — you’ll want to share it with your colleagues and friends! For a nice user experience, we’ll export the employee name to come from the command line, rather than ask them to open the code.
Read argument
We’ll start with the happy path — assuming the user entered the employee name — and add code that uses it:
In pdf_to_dict(), instead of hard coding EMPLOYEE_NAME = "IFAT NEUMANN", we'll read it from the arguments: employee_name = sys.argv[1]. Don't forget to import sys!
Now we’ll think of other scenarios:
No employee name was given
What if the user did not enter any employee name? We’d wanna catch it as soon as possible, and notify them!
Therefore, we’ll add a check in the first line of the main function. Now, the intuition is to initialize the employee_name variable there - but this will cause bubbling function properties until it reaches the function that uses this variable - and I don't find it a very clean approach.
Lastly, I’ll just try accessing this field — and catch if it’s not there:
main.py: # translate 1 pdf to 1 dict # loop over the pdf dir # save all dicts to 1 json file # translate json report to csv report
Note that adding exceptions means the print_warning() function moves to main.py. Otherwise, you’ll get an error:
Employee name without quotes
You can skip the quotation requirements and loop over arguments, collecting all the parts of the user name — but I find this approach adds unnecessary complexity.
The employee’s name doesn’t appear on the payslip
We won’t be able to find gross pay if we don’t have the employee’s name as it appears on the payslip.
Lastly, we catch the error back in the main function:
Here is the full code you can use for playing with your payslips:
https://cupofcode.blog/ |
The above is the detailed content of Weekend Coding: Turn PDF Payslips Into a Single CSV Report. For more information, please follow other related articles on the PHP Chinese website!