How to Extract Text from PDF Files using PDFMiner in Python with the Latest API Changes?-Python Tutorial-php.cn

How to Extract Text from PDF Files using PDFMiner in Python with the Latest API Changes?

Linda Hamilton

Release： 2024-10-17 14:23:29

Original

725 people have browsed it

How to Extract Text from PDF Files using PDFMiner in Python with the Latest API Changes?

Text Extraction from PDF Files Using PDFMiner in Python

Extracting text from a PDF file is a common task when working with structured data. Python provides the PDFMiner library to facilitate this process. However, recent updates to the PDFMiner API have rendered many previous examples obsolete.

To address this, let's explore a working example of text extraction using the current version of PDFMiner:

<code class="python">from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text</code>

Copy after login

This function takes a PDF file path as input and returns the extracted text as a string. It handles common scenarios such as password-protected PDFs and multi-page documents.

By using the latest version of PDFMiner and implementing this function, you can efficiently extract text from PDF files in your Python applications.

The above is the detailed content of How to Extract Text from PDF Files using PDFMiner in Python with the Latest API Changes?. For more information, please follow other related articles on the PHP Chinese website!