如何在 Python 中使用最新版本的 PDFMiner 從 PDF 檔案中提取文字？-Python教學-PHP中文網

如何在 Python 中使用最新版本的 PDFMiner 從 PDF 檔案中提取文字？

Patricia Arquette

發布： 2024-10-17 14:29:30

原創

494 人瀏覽過

How to Extract Text from PDF Files Using the Latest Version of PDFMiner in Python?

使用Python 中的PDFMiner 從PDF 文件中提取文本

問題：

我如何使用最新版本的Python 中的PDFMiner 從PDF 文件中提取文字？

答案：

PDFMiner 最近進行了重大的 API 更新。以下是使用目前版本提取文字的方法：

<code class="python">from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text</code>

登入後複製

注意：此解決方案解決了 PDFMiner 最近更新引入的 API 更改，確保與當前版本的庫的兼容性。

以上是如何在 Python 中使用最新版本的 PDFMiner 從 PDF 檔案中提取文字？的詳細內容。更多資訊請關注PHP中文網其他相關文章！