Home > Backend Development > Python Tutorial > How to extract PDF text in python

How to extract PDF text in python

(*-*)浩
Release: 2019-07-09 10:21:49
Original
6126 people have browsed it

This article shows you how to use Python to extract the text content of many PDF files in batches.

How to extract PDF text in python

First, we read in some modules to perform file operations. (Recommended learning: Python video tutorial)

import glob
import os
Copy after login

There are two folders in the demo directory, namely pdf and newpdf.

We specify the path where the pdf file is located as the pdf folder.

pdf_path = "pdf/"
Copy after login

We want to get the path of all pdf files. With glob, this function can be completed with one command.

pdfs = glob.glob("{}/*.pdf".format(pdf_path))
Copy after login

See if the pdf file path we obtained is correct.

pdfs
Copy after login
['pdf/复杂系统仿真的微博客虚假信息扩散模型研究.pdf',
'pdf/面向影子分析的社交媒体竞争情报搜集.pdf',
'pdf/面向人机协同的移动互联网政务门户探析.pdf']
Copy after login

Verified. Accurate.

Below we use pdfminer to extract content from pdf files. We need to read in the function extract_pdf_content from the helper Python file pdf_extractor.py.

from pdf_extractor import extract_pdf_content
Copy after login

Using this function, we try to extract the content from the first article in the pdf file list and save the text in the content variable.

content = extract_pdf_content(pdfs[0])
Copy after login

Obviously, the content extraction is not perfect, headers, footers and other information are mixed in. However, for many of our text analysis uses this will not matter.

For more Python related technical articles, please visit the Python Tutorial column to learn!

The above is the detailed content of How to extract PDF text in python. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template