1. Introduction
I was reading the book "Python Network Data Collection" at night and saw the code for reading PDF content. I remembered that a few days ago, Jisouke had just released a crawling rule for grabbing PDF content from web pages. This rule can treat pdf content as html for web crawling. The magic is due to Firefox's ability to parse PDF and convert the PDF format into HTML tags, such as div tags, so that GooSeeker web crawling software can be used to crawl structured content just like ordinary web pages.
This raises a question: How far can it be achieved using Python crawlers. An experimental process and source code will be described below.
2. Python source code to convert pdf into text
The python source code below reads the content of the pdf file (on the Internet or locally), converts it into text, and prints it out. This code mainly uses a third-party library PDFMiner3K to read PDF into a string, and then uses StringIO to convert it into a file object. (See the GitHub source at the end of the article for the source code download address)
from urllib.request import urlopen from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from io import StringIO from io import open def readPDF(pdfFile): rsrcmgr = PDFResourceManager() retstr = StringIO() laparams = LAParams() device = TextConverter(rsrcmgr, retstr, laparams=laparams) process_pdf(rsrcmgr, device, pdfFile) device.close() content = retstr.getvalue() retstr.close() return content pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") outputString = readPDF(pdfFile) print(outputString) pdfFile.close()
If the PDF file is on your computer, replace the pdfFile object returned by urlopen with a normal open() file object.
3. Outlook
This experiment only converts pdf into text, but does not convert it into html tags as mentioned at the beginning. So whether this capability is available in the Python programming environment remains to be explored in the future.