Python function to convert one or multiple PDF files to JPG image files

Botir Rakhimov
2 min readFeb 1, 2021

Just recently I started working on invoice classification model using transfer learning TensorFlow models. My end goal is to automate invoice processing where model classifies which vendor invoices come from and based on OCR enters data to system. I will write about this one once the project is finished.

But meanwhile I’d like to share how a simple you might think challenge of converting invoices in PDF to JPG format to train the model gave me some headache. As we know to train an image classification model we need files to be in image format in order we can convert them to tensors. I have more than 3,000 PDF files to convert and Python came to the rescue! Well, who else would….

The entire conversion of 3k files took me around 40 minutes and I used pdf2image library with some additional rendering package poppler.

So, for the folks who may come across the similar question, here is the function I used for converting my files:

Let’s install our libraries first:

pip3 install pdf2image

And we also need to download our PDF rendering tool poopler and establish it’s path:

poppler_path=r'C:\Users\*****\Downloads\poppler-0.68.0_x86 (1)\poppler-0.68.0\bin'

Let’s import our pdf2image library as well:

from pdf2image import convert_from_path

We are good to go! Let’s go ahead with a function:

import os#Define your path where pdf files are stored
pdf_path=r'C:\Users\******\Folder\'
#New or additional name for converted images
img_name="Test_image"
def convert_pdf_to_jpg(pdf_path):
for root,dirs,files in os.walk(pdf_path):
for filename in files:
basename,extension=os.path.splitext(filename)
if extension ==".pdf":
path=root+'//'+basename+extension
pdf_list=[]
pdf_list.append(path)
for files in pdf_list:
images=convert_from_path(files,poppler_path)
for i in images:
img=img_name+'-'+basename+'.jpg'
i.save(img,"JPEG")

--

--

Botir Rakhimov

Implementing Python to give automated solutions in Finance