How Can You Extract Images from PDFs Using Python While Preserving Their Original Resolution?-Python Tutorial-php.cn

How Can You Extract Images from PDFs Using Python While Preserving Their Original Resolution?

DDD

Release： 2024-10-22 07:52:30

Original

784 people have browsed it

How Can You Extract Images from PDFs Using Python While Preserving Their Original Resolution?

Extracting Images from PDFs without Resampling Using Python

To efficiently extract all images from a PDF document while preserving their native resolution and format without resampling, you can utilize the PyMuPDF module. This module provides an effective solution for image extraction, outputting images as PNG files.

Using PyMuPDF:

<code class="python">import fitz

# Open the PDF document
doc = fitz.open("file.pdf")

# Iterate through the pages
for i in range(len(doc)):
    # Extract images from the current page
    for img in doc.getPageImageList(i):
        # Retrieve the image's XREF and create a Pixmap
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)

        # Check if the image is grayscale or RGB
        if pix.n < 5:
            # Save the image in PNG format
            pix.writePNG("p%s-%s.png" % (i, xref))

        # If the image is CMYK, convert it to RGB and save
        else:
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None

        # Release the Pixmaps
        pix = None</code>

Copy after login

Enhancements:

For an updated version of the script that supports fitz 1.19.6:

<code class="python">import os
import fitz
from tqdm import tqdm

# Specify the work directory
workdir = "your_folder"

# Iterate through the PDFs in the directory
for each_path in os.listdir(workdir):
    if ".pdf" in each_path:
        # Open the PDF document
        doc = fitz.Document(os.path.join(workdir, each_path))

        for i in tqdm(range(len(doc)), desc="pages"):
            for img in tqdm(doc.get_page_images(i), desc="page_images"):
                # Extract the image and save as PNG
                xref = img[0]
                image = doc.extract_image(xref)
                pix = fitz.Pixmap(doc, xref)
                pix.save(os.path.join(workdir, "%s_p%s-%s.png" % (each_path[:-4], i, xref)))</code>

Copy after login

This enhanced script provides progress bars for added visibility and saves the extracted images with consistent file naming conventions.

The above is the detailed content of How Can You Extract Images from PDFs Using Python While Preserving Their Original Resolution?. For more information, please follow other related articles on the PHP Chinese website!