How to Extract High-Resolution Images from PDFs Without Resampling Using Python?-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

How to Extract High-Resolution Images from PDFs Without Resampling Using Python?

Mary-Kate Olsen

Oct 22, 2024 am 07:52 AM

How to Extract High-Resolution Images from PDFs Without Resampling Using Python?

How to Extract Images from PDFs Without Resampling in Python?

To extract images from a PDF document with their original resolution and format, without resampling, you can utilize the PyMuPDF module. This Python module allows you to efficiently process PDF files and manipulate their content. Here's how you can use PyMuPDF to extract images:

<code class="python">import fitz

doc = fitz.open("input.pdf")
for page_num in range(len(doc)):
    for img in doc.getPageImageList(page_num):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:  # Check if it's grayscale or RGB
            pix.writePNG(f"page_{page_num}_img_{xref}.png")
        else:  # Convert CMYK to RGB before saving
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG(f"page_{page_num}_img_{xref}.png")</code>

Copy after login

In this code, we iterate through the pages and images within the PDF. The 'xref' variable represents the image's unique identifier. Depending on the image's color space (RGB or CMYK), we either write the PNG image directly or convert CMYK to RGB before saving it.

Alternatively, if you're using fitz version 1.19.6, you can use the following code to perform the extraction with a progress bar for better visibility:

<code class="python">import os
import fitz
from tqdm import tqdm

workdir = "path_to_pdf_folder"

for each_path in os.listdir(workdir):
    if ".pdf" in each_path:
        doc = fitz.Document(os.path.join(workdir, each_path))

        for i in tqdm(range(len(doc)), desc="pages"):
            for img in tqdm(doc.get_page_images(i), desc="page_images"):
                xref = img[0]
                image = doc.extract_image(xref)
                pix = fitz.Pixmap(doc, xref)
                pix.save(os.path.join(workdir, f"{each_path[:-4]}_p{i}-{xref}.png"))</code>

Copy after login

These code snippets will enable you to extract images from a PDF, preserving their original resolution and format.

The above is the detailed content of How to Extract High-Resolution Images from PDFs Without Resampling Using Python?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn