How to Extract Images from PDFs Without Resampling in Python?
To extract images from a PDF document with their original resolution and format, without resampling, you can utilize the PyMuPDF module. This Python module allows you to efficiently process PDF files and manipulate their content. Here's how you can use PyMuPDF to extract images:
<code class="python">import fitz doc = fitz.open("input.pdf") for page_num in range(len(doc)): for img in doc.getPageImageList(page_num): xref = img[0] pix = fitz.Pixmap(doc, xref) if pix.n < 5: # Check if it's grayscale or RGB pix.writePNG(f"page_{page_num}_img_{xref}.png") else: # Convert CMYK to RGB before saving pix1 = fitz.Pixmap(fitz.csRGB, pix) pix1.writePNG(f"page_{page_num}_img_{xref}.png")</code>
In this code, we iterate through the pages and images within the PDF. The 'xref' variable represents the image's unique identifier. Depending on the image's color space (RGB or CMYK), we either write the PNG image directly or convert CMYK to RGB before saving it.
Alternatively, if you're using fitz version 1.19.6, you can use the following code to perform the extraction with a progress bar for better visibility:
<code class="python">import os import fitz from tqdm import tqdm workdir = "path_to_pdf_folder" for each_path in os.listdir(workdir): if ".pdf" in each_path: doc = fitz.Document(os.path.join(workdir, each_path)) for i in tqdm(range(len(doc)), desc="pages"): for img in tqdm(doc.get_page_images(i), desc="page_images"): xref = img[0] image = doc.extract_image(xref) pix = fitz.Pixmap(doc, xref) pix.save(os.path.join(workdir, f"{each_path[:-4]}_p{i}-{xref}.png"))</code>
These code snippets will enable you to extract images from a PDF, preserving their original resolution and format.
The above is the detailed content of How to Extract High-Resolution Images from PDFs Without Resampling Using Python?. For more information, please follow other related articles on the PHP Chinese website!