Python reads PDF content-Python Tutorial-php.cn

Python reads PDF content

高洛峰

Release： 2016-11-22 16:44:51

Original

1911 people have browsed it

1. Introduction

I was reading the book "Python Network Data Collection" at night and saw the code for reading PDF content. I remembered that a few days ago, Jisouke had just released a crawling rule for grabbing PDF content from web pages. This rule can treat pdf content as html for web crawling. The magic is due to Firefox's ability to parse PDF and convert the PDF format into HTML tags, such as div tags, so that GooSeeker web crawling software can be used to crawl structured content just like ordinary web pages.

This raises a question: How far can it be achieved using Python crawlers. An experimental process and source code will be described below.

2. Python source code to convert pdf into text

The python source code below reads the content of the pdf file (on the Internet or locally), converts it into text, and prints it out. This code mainly uses a third-party library PDFMiner3K to read PDF into a string, and then uses StringIO to convert it into a file object. (See the GitHub source at the end of the article for the source code download address)

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)

    process_pdf(rsrcmgr, device, pdfFile)
    device.close()

    content = retstr.getvalue()
    retstr.close()
    return content

pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")
outputString = readPDF(pdfFile)
print(outputString)
pdfFile.close()

Copy after login

If the PDF file is on your computer, replace the pdfFile object returned by urlopen with a normal open() file object.

3. Outlook

This experiment only converts pdf into text, but does not convert it into html tags as mentioned at the beginning. So whether this capability is available in the Python programming environment remains to be explored in the future.