当 PDF 为英文并且没有嵌入字体时,从 PDF 中提取文本通常很简单。然而,一旦消除这些假设,使用 pdfminer 或 pdfplumber 等基本 Python 库就变得具有挑战性。上个月,我的任务是从古吉拉特语 PDF 中提取文本,并将姓名、地址、城市等数据字段导入 JSON 格式。
如果字体嵌入在 PDF 本身中,简单的复制粘贴将不起作用,并且使用 pdfplumber 将返回不可读的垃圾文本。因此,我必须将每个 PDF 页面转换为图像,然后使用 pytesseract 库应用 OCR 来“扫描”页面,而不仅仅是阅读它。本教程将向您展示如何做到这一点。
您可以使用 pip 命令安装 Python 库,如下所示。对于 Tesseract-OCR,请从官方网站下载并安装该软件。 pytesseract 只是 tesseract 软件的包装。
pip install pdfplumber pip install pdf2image pip install pytesseract
第一步是将 PDF 页面转换为图像。这个 extract_text_from_pdf() 函数正是这样做的 - 您将 PDF 路径和 page_num (零索引)作为参数传递。请注意,为了清晰起见,我首先将页面转换为黑白,这是可选的。
# Extract text from a specific page of a PDF def extract_text_from_pdf(pdf_path, page_num): # Use pdfplumber to open the PDF pdf = pdfplumber.open(pdf_path) print(f"extracting page {page_num}..") page = pdf.pages[page_num] images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1) image = images[0] # Convert to black and white bw_image = convert_to_bw(image) # Save the B&W image for debugging (optional) #bw_image.save("bw_page.png") # Perform OCR on the B&W image e_text = ocr_image(bw_image) open('out.txt', 'w', encoding='utf-8').write(e_text) #print("output written to file.") try: process_text(page_num, e_text) except Exception as e: print("Error occurred:", e) print("done..") # Convert image to black and white def convert_to_bw(image): # Convert to grayscale gray = image.convert('L') # Apply threshold to convert to pure black and white bw = gray.point(lambda x: 0 if x < 128 else 255, '1') return bw # Perform OCR using Tesseract on a given image def ocr_image(image_path): try: # Perform OCR custom_config = r'--oem 3 --psm 6 -l guj+eng' text = pytesseract.image_to_string(image_path, config=custom_config) # --psm 6 treats the image as a block of text return text except Exception as e: print(f"Error during OCR: {e}") return None
ocr_image()函数使用pytesseract通过OCR从图像中提取文本。 --oem 和 --psm 等技术参数控制图像的处理方式,-l guj eng 参数设置要读取的语言。由于此 PDF 偶尔包含英文文本,因此我使用了 guj eng。
使用 OCR 导入文本后,您可以按照您想要的格式解析它。这与其他 PDF 库(如 pdfplumber 或 pypdf2)类似。
nums = ['0', '૧', '૨', '૩', '૪', '૫', '૬', '૭', '૮', '૯'] def process_text(page_num, e_text): obj = None last_surname = None last_kramank = None print(f"processing page {page_num}..") for line in e_text.splitlines(): line = line.replace('|', '').replace('[', '').replace(']', '') parts = [word for word in line.split(' ') if word] if len(parts) == 0: continue new_rec = True for char in parts[0]: if char not in nums: new_rec = False break if len(parts) < 2: continue if new_rec and len(parts[0]) >= 2: # numbered line if len(parts) < 9: continue if obj: records.append(obj) obj = {} last_surname = parts[1] obj['kramank'] = parts[0] last_kramank = parts[0] obj['full_name'] = ' '.join(parts[1:4]) obj['surname'] = parts[1] obj['pdf_page_num'] = page_num + 1 obj['registered_by'] = parts[4] obj['village_vatan'] = parts[5] obj['village_mosal'] = parts[6] if parts[8] == 'વર્ષ': idx = 7 obj['dob'] = parts[idx] + ' વર્ષ' idx += 1 elif len(parts[7]) == 8 and parts[7][2] == '-': idx = 7 obj['dob'] = parts[idx] else: print("warning: no date") idx = 6 obj['marital_status'] = parts[idx+1] obj['extra_fields'] = '::'.join(parts[idx+2:-2]) obj['blood_group'] = parts[-1] elif parts[0] == last_surname: # new member in existing family if obj: records.append(obj) obj = {} obj['kramank'] = last_kramank obj['surname'] = last_surname obj['full_name'] = ' '.join(parts[0:3]) obj['pdf_page_num'] = page_num + 1 obj['registered_by'] = parts[3] obj['village_vatan'] = parts[4] obj['village_mosal'] = parts[5] if len(parts) <= 6: continue if parts[7] == 'વર્ષ': # date exists idx = 6 obj['dob'] = parts[idx] + ' વર્ષ' idx += 1 elif len(parts[6]) == 8 and parts[6][2] == '-': idx = 6 obj['dob'] = parts[idx] else: print("warning: no date") idx = 5 obj['marital_status'] = parts[idx+1] obj['extra_fields'] = '::'.join(parts[idx+2:-2]) obj['blood_group'] = parts[-1] elif obj: # continuation lines if ("(" in line and ")" in line) or "મો.ઃ" in line: obj['extra_fields'] += ' ' + '::'.join(parts[0:]) if obj: records.append(obj) jstr = json.dumps(records, indent=4) open("guj.json", 'w', encoding='utf-8').write(jstr) print(f"written page {page_num} to json..")
每个 PDF 都有其必须考虑的细微差别。在这种情况下,当后续字段(姓氏)更改时,第一个字段中的新序列号(如 0૧ 或 0૨)表示一个新组。
pytesseract 是 IT 技术发展和进步的证明。大约十年前,在配置适度的 PC 或笔记本电脑上使用非英语 OCR 读取或解析 PDF 图像几乎是不可能的。这才是真正的进步!祝您编码愉快,请在下面的评论中告诉我进展如何。
以上是解锁嵌入字体 PDF 中的文本:pytesseract OCR 教程的详细内容。更多信息请关注PHP中文网其他相关文章!