How two simple Python functions are saving us bunch of time at work

3 min readJan 28, 2021

For years me and my colleagues here in Indonesia branch office spent a lot of time to manually split and rename each PDF files of Tax invoices (Faktur Pajak in Indonesian) generated from the separate Tax system (e-faktur — System).

The thing here is when you bill VAT to your customers, apart from the invoices generated from your own system, one has to issue VAT invoice in separate government provided system. That alone creates a loophole where you need to always make sure VAT you bill in both systems are in same amount. But that is always monitored by reconciling two account reports on monthly basis.

Usually there is a uniform template where we do data upload into the e-faktur system, which issues VAT tax slips based on that data and generates PDF files — in one file OR separately page by page. Respective tax staff then will need to spend an hour or two everyday (on average depending how many invoices issued) in order to split those files one by one, rename it with correct Customer name, unique Invoice number and current date. Say he/she spends a minute per each invoice typing all those things multiplied by hundred a day — you do the math! Splitting is no problem you might say, there are hundreds of online web sites/converters of all types available in Google. But again would you be willing to upload commercially sensitive data such as invoices or tax invoices into third party servers compensating your data privacy?

So, what we came up with was a 50-line code which in two very simple functions splits PDF files by page, scrapes data out of it and renames each file with those parameters it extracts from individual pages. Python, some of its available libraries and powerful RegEx made this all possible! And guess what, it now takes 10 seconds to split, rename and save the PDFs.

Let’s have a look on these functions in below explanation:

Tools we will need are pdf libraries such as PyPdf2, pdfminer (which gives excellent parsing) and in-built tools such as os , datetime and re

# Import the nessesary packages
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
import os, PyPDF2, re
import datetime

The first function takes pdf file and splits every page into separate file:

def split_pdf_pages(root_directory, extract_folder):
 for root, dirs, files in os.walk(root_directory):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   # Choosing each pdf files
   if extension == ".pdf":
     # Create file path name for pdf formatted files
     path = root + "\\" + basename + extension
     # Open the files with PdfFileReader
     open_pdf = PyPDF2.PdfFileReader(open(path, "rb"))
     #Loop through the file and split the pages
     for i in range(open_pdf.numPages):
        output = PyPDF2.PdfFileWriter()
        output.addPage(open_pdf.getPage(i))
        with open(extract_to+ "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
           output.write(output_pdf)

And then 2nd function opens each split pdf files and renames based in instructions given. Below instructions are based on Customer name, commercial invoice number, month and year. We use help of re (RegEx) and datetime functions:

def rename_pdfs(root_directory, extract_folder):
 for root, dirs, files in os.walk(root_directory):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   if extension == ".pdf":
    path = root + "\\" + basename + extension     output_string = StringIO()
    pdf_file = open(path, "rb")
    parser = PDFParser(pdf_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)    for page in PDFPage.create_pages(doc):
     interpreter.process_page(page)
    num = output_string.getvalue()    for cust_name in re.findall("CV.\s[A-Za-z]+",num):
     cust = cust_name[0:50]
    for inv in re.findall("#[0-9]+", num):
     inv_num = inv[:8]    for month in re.findall("January", num) or re.findall("February", num) or re.findall("March",
                          num) or re.findall(
     "April", num) or re.findall("May", num) or re.findall("June", num) or re.findall("July",
                          num) or re.findall(
     "August", num) or re.findall("September", num) or re.findall("October", num) or re.findall(
     "November", num) or re.findall("December", num):
     date = month     year=datetime.date.today().year    pdf_file.close()    os.rename(path, rename_to + "//" + f"{cust}  { inv_num} { date}  { year} .pdf")

Finally , we state the folder directions:
Root directions — where we access the initial file
Extract folder — where we extract the pages
Rename folder — where we save renamed ready to send pdf tax invoices
— And run the function!

# Folder paths
root_dir = r"C:\Users\*****\******\initial"
extract_to = r"C:\Users\*****\******\extract"
rename_to = r"C:\Users\*****\******\final"# Run the functions
split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)

How two simple Python functions are saving us bunch of time at work

Written by Botir Rakhimov