如何在 Python 中使用 PDFMiner 從 PDF 中提取文字?

Patricia Arquette
發布: 2024-10-17 14:26:02
原創
684 人瀏覽過

How to Extract Text from PDFs with PDFMiner in Python?

Extracting Text from PDFs with PDFMiner in Python

Question:

How can I extract text from a PDF file using PDFMiner in Python?

Answer:

Due to recent updates in PDFMiner's API, some existing documentation may contain outdated code. To extract text from a PDF file using the latest version of PDFMiner, follow these steps:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def extract_pdf_text(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
登入後複製

This updated code addresses the changes in PDFMiner's syntax. It successfully extracts text from PDF files, as verified with Python 3.x, 3.7, and October 3, 2019 Python 3.7 using pdfminer.six, released in November 2018.

以上是如何在 Python 中使用 PDFMiner 從 PDF 中提取文字?的詳細內容。更多資訊請關注PHP中文網其他相關文章!

來源:php
本網站聲明
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn
作者最新文章
熱門教學
更多>
最新下載
更多>
網站特效
網站源碼
網站素材
前端模板
關於我們 免責聲明 Sitemap
PHP中文網:公益線上PHP培訓,幫助PHP學習者快速成長!