Extracting Images from PDFs without Resampling Using Python
To efficiently extract all images from a PDF document while preserving their native resolution and format without resampling, you can utilize the PyMuPDF module. This module provides an effective solution for image extraction, outputting images as PNG files.
Using PyMuPDF:
<code class="python">import fitz # Open the PDF document doc = fitz.open("file.pdf") # Iterate through the pages for i in range(len(doc)): # Extract images from the current page for img in doc.getPageImageList(i): # Retrieve the image's XREF and create a Pixmap xref = img[0] pix = fitz.Pixmap(doc, xref) # Check if the image is grayscale or RGB if pix.n < 5: # Save the image in PNG format pix.writePNG("p%s-%s.png" % (i, xref)) # If the image is CMYK, convert it to RGB and save else: pix1 = fitz.Pixmap(fitz.csRGB, pix) pix1.writePNG("p%s-%s.png" % (i, xref)) pix1 = None # Release the Pixmaps pix = None</code>
Enhancements:
For an updated version of the script that supports fitz 1.19.6:
<code class="python">import os import fitz from tqdm import tqdm # Specify the work directory workdir = "your_folder" # Iterate through the PDFs in the directory for each_path in os.listdir(workdir): if ".pdf" in each_path: # Open the PDF document doc = fitz.Document(os.path.join(workdir, each_path)) for i in tqdm(range(len(doc)), desc="pages"): for img in tqdm(doc.get_page_images(i), desc="page_images"): # Extract the image and save as PNG xref = img[0] image = doc.extract_image(xref) pix = fitz.Pixmap(doc, xref) pix.save(os.path.join(workdir, "%s_p%s-%s.png" % (each_path[:-4], i, xref)))</code>
This enhanced script provides progress bars for added visibility and saves the extracted images with consistent file naming conventions.
The above is the detailed content of How Can You Extract Images from PDFs Using Python While Preserving Their Original Resolution?. For more information, please follow other related articles on the PHP Chinese website!