解锁嵌入字体 PDF 中的文本：pytesseract OCR 教程-Python教程-PHP中文网

Unlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial

当 PDF 为英文并且没有嵌入字体时，从 PDF 中提取文本通常很简单。然而，一旦消除这些假设，使用 pdfminer 或 pdfplumber 等基本 Python 库就变得具有挑战性。上个月，我的任务是从古吉拉特语 PDF 中提取文本，并将姓名、地址、城市等数据字段导入 JSON 格式。

如果字体嵌入在 PDF 本身中，简单的复制粘贴将不起作用，并且使用 pdfplumber 将返回不可读的垃圾文本。因此，我必须将每个 PDF 页面转换为图像，然后使用 pytesseract 库应用 OCR 来“扫描”页面，而不仅仅是阅读它。本教程将向您展示如何做到这一点。

你需要的东西

pdfplumber（Python 库）
pdf2image（Python 库）
pytesseract（Python 库）
tesseract-ocr

您可以使用 pip 命令安装 Python 库，如下所示。对于 Tesseract-OCR，请从官方网站下载并安装该软件。 pytesseract 只是 tesseract 软件的包装。

pip install pdfplumber
pip install pdf2image
pip install pytesseract

登录后复制

将 PDF 页面转换为图像

第一步是将 PDF 页面转换为图像。这个 extract_text_from_pdf() 函数正是这样做的 - 您将 PDF 路径和 page_num （零索引）作为参数传递。请注意，为了清晰起见，我首先将页面转换为黑白，这是可选的。

# Extract text from a specific page of a PDF
def extract_text_from_pdf(pdf_path, page_num):
    # Use pdfplumber to open the PDF
    pdf = pdfplumber.open(pdf_path)
    print(f"extracting page {page_num}..")
    page = pdf.pages[page_num]
    images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1)
    image = images[0]
    # Convert to black and white
    bw_image = convert_to_bw(image)
    # Save the B&W image for debugging (optional)
    #bw_image.save("bw_page.png")
    # Perform OCR on the B&W image
    e_text = ocr_image(bw_image)
    open('out.txt', 'w', encoding='utf-8').write(e_text)
    #print("output written to file.")
    try:
        process_text(page_num, e_text)
    except Exception as e:
        print("Error occurred:", e)
    print("done..")

# Convert image to black and white
def convert_to_bw(image):
    # Convert to grayscale
    gray = image.convert('L')
    # Apply threshold to convert to pure black and white
    bw = gray.point(lambda x: 0 if x < 128 else 255, '1')
    return bw

# Perform OCR using Tesseract on a given image
def ocr_image(image_path):
    try:
        # Perform OCR
        custom_config = r'--oem 3 --psm 6 -l guj+eng'
        text = pytesseract.image_to_string(image_path, config=custom_config)  # --psm 6 treats the image as a block of text
        return text
    except Exception as e:
        print(f"Error during OCR: {e}")
        return None

登录后复制

ocr_image()函数使用pytesseract通过OCR从图像中提取文本。 --oem 和 --psm 等技术参数控制图像的处理方式，-l guj eng 参数设置要读取的语言。由于此 PDF 偶尔包含英文文本，因此我使用了 guj eng。

处理文本

使用 OCR 导入文本后，您可以按照您想要的格式解析它。这与其他 PDF 库（如 pdfplumber 或 pypdf2）类似。

nums = ['0', '૧', '૨', '૩', '૪', '૫', '૬', '૭', '૮', '૯']

def process_text(page_num, e_text):
    obj = None
    last_surname = None
    last_kramank = None
    print(f"processing page {page_num}..")
    for line in e_text.splitlines():
        line = line.replace('|', '').replace('[', '').replace(']', '')
        parts = [word for word in line.split(' ') if word]
        if len(parts) == 0: continue
        new_rec = True
        for char in parts[0]:
            if char not in nums:
                new_rec = False
                break
        if len(parts) < 2: continue

        if new_rec and len(parts[0]) >= 2: # numbered line
            if len(parts) < 9: continue
            if obj: records.append(obj)
            obj = {}
            last_surname = parts[1]
            obj['kramank'] = parts[0]
            last_kramank = parts[0]
            obj['full_name'] = ' '.join(parts[1:4])
            obj['surname'] = parts[1]
            obj['pdf_page_num'] = page_num + 1
            obj['registered_by'] = parts[4]
            obj['village_vatan'] = parts[5]
            obj['village_mosal'] = parts[6]
            if parts[8] == 'વર્ષ':
                idx = 7
                obj['dob'] = parts[idx] + ' વર્ષ'
                idx += 1
            elif len(parts[7]) == 8 and parts[7][2] == '-':
                idx = 7
                obj['dob'] = parts[idx]
            else:
                print("warning: no date")
                idx = 6
            obj['marital_status'] = parts[idx+1]
            obj['extra_fields'] = '::'.join(parts[idx+2:-2])
            obj['blood_group'] = parts[-1]
        elif parts[0] == last_surname: # new member in existing family
            if obj: records.append(obj)
            obj = {}
            obj['kramank'] = last_kramank
            obj['surname'] = last_surname
            obj['full_name'] = ' '.join(parts[0:3])
            obj['pdf_page_num'] = page_num + 1
            obj['registered_by'] = parts[3]
            obj['village_vatan'] = parts[4]
            obj['village_mosal'] = parts[5]
            if len(parts) <= 6: continue
            if parts[7] == 'વર્ષ': # date exists
                idx = 6
                obj['dob'] = parts[idx] + ' વર્ષ'
                idx += 1
            elif len(parts[6]) == 8 and parts[6][2] == '-':
                idx = 6
                obj['dob'] = parts[idx]
            else:
                print("warning: no date")
                idx = 5
            obj['marital_status'] = parts[idx+1]
            obj['extra_fields'] = '::'.join(parts[idx+2:-2])
            obj['blood_group'] = parts[-1]
        elif obj: # continuation lines
            if ("(" in line and ")" in line) or "મો.ઃ" in line:
                obj['extra_fields'] += ' ' + '::'.join(parts[0:])
    if obj: records.append(obj)        
    jstr = json.dumps(records, indent=4)
    open("guj.json", 'w', encoding='utf-8').write(jstr)
    print(f"written page {page_num} to json..")

登录后复制

每个 PDF 都有其必须考虑的细微差别。在这种情况下，当后续字段（姓氏）更改时，第一个字段中的新序列号（如 0૧ 或 0૨）表示一个新组。

pytesseract 是 IT 技术发展和进步的证明。大约十年前，在配置适度的 PC 或笔记本电脑上使用非英语 OCR 读取或解析 PDF 图像几乎是不可能的。这才是真正的进步！祝您编码愉快，请在下面的评论中告诉我进展如何。