Comment convertir des PDF en Markdown à l'aide de PyMuPDFM et de son évaluation

Linda Hamilton
Libérer: 2024-10-07 18:12:31
original
107 Les gens l'ont consulté

PyMuPDF4LLM is a library designed to convert PDFs into Markdown format. Here, I’ll share my experience testing this library.

Installation

Start by installing the library using the following command:


pip install pymupdf4llm


Copier après la connexion

Usage

The basic usage is quite simple, requiring just three lines of code to convert a PDF to Markdown:


import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")
print(md_text)


Copier après la connexion

You can specify arguments to adjust how content is extracted.

Extracting Text by Page

By default, the entire PDF is converted into a single text output. However, you can extract text page by page by specifying page_chunks=True.


md_text = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)


Copier après la connexion

Extracting Images

To extract images as files, use the write_images=True option:


md_text = pymupdf4llm.to_markdown("input.pdf", write_images=True)


Copier après la connexion
Copier après la connexion

It’s also possible to embed images directly in the Markdown using base64 encoding:


md_text = pymupdf4llm.to_markdown("input.pdf", embed_images=True)


Copier après la connexion

Evaluation of Conversion Results

For testing, various PDFs with different Markdown elements were used.

How to Convert PDFs to Markdown Using PyMuPDFM and Its Evaluation

Header Conversion

Headers are correctly converted into Markdown format. Here is a portion of the result:


# Sample Markdown Guide

This is a sample markdown file that includes various features for quick reference.

## 1. Headers

...

## 3. Lists


Copier après la connexion

Bold and Italic Text

Bold and italic formatting is also properly converted:


**Bold: **Bold Text****

_Italic: *Italic Text*_

**_Bold and Italic: ***Bold and Italic***_**


Copier après la connexion

List Conversion

Ordered lists at the first level are converted without issues, but nested lists and unordered lists are not accurately converted.

How to Convert PDFs to Markdown Using PyMuPDFM and Its Evaluation


## 3. Lists

### Unordered List

Item 1

Item 2

Sub-item 1

Sub-item 2

### Ordered List

1. First item

2. Second item

1. Sub-item A

2. Sub-item B


Copier après la connexion

Link Conversion

The URLs of links are extracted, but the entire line containing the link becomes a hyperlink, deviating from the original format.

How to Convert PDFs to Markdown Using PyMuPDFM and Its Evaluation


## 4. Links and Images

[You can add links using [Link Text](URL).](https://www.example.com/)


Copier après la connexion

Image Extraction

Images are not extracted by default but can be saved locally with write_images=True.


md_text = pymupdf4llm.to_markdown("input.pdf", write_images=True)


Copier après la connexion
Copier après la connexion

The saved images are then referenced in the Markdown as follows:


<p>### Image Example</p>

<p>![](input.pdf-1-0.png)</p>

Copier après la connexion




Table Conversion

Simple tables without vertical borders are not accurately converted (likely because ambiguous column boundaries result in tables being treated as plain text).

How to Convert PDFs to Markdown Using PyMuPDFM and Its Evaluation


<p>## 5. Tables</p>

<p>**Column 1** **Column 2** **Column 3**</p>

<p>Row 1 Data A Data B</p>

<p>Row 2 Data C Data D</p>

Copier après la connexion




Code Conversion

Code blocks are correctly converted, but language specification (e.g., python) is not retained. Inline code conversion also has issues.

How to Convert PDFs to Markdown Using PyMuPDFM and Its Evaluation


<p>## 6. Code</p>

<p>### Inline Code</p>

<p>Use backticks for inline code: print("Hello, world!")</p>

<p>### Code Block</p>

<p>Use triple backticks for code blocks:</p>

<p>```<br>
def greet(name):<br>
  return f"Hello, {name}!"<br>
print(greet("Markdown"))<br>
```</p>

Copier après la connexion




Multi-Line Text

For multi-line text, the line breaks are preserved as they appear in the original PDF.

How to Convert PDFs to Markdown Using PyMuPDFM and Its Evaluation


<p>Markdown is a lightweight and versatile markup language favored by developers, writers, and bloggers alike</p>

<p>due to its simplicity in formatting text, enabling users to create readable and well-structured documents—</p>

<p>whether for documentation, blog posts, or articles—without the complexity of HTML, while also offering the</p>

<p>ability to convert content seamlessly into other formats like HTML, PDF, and even slideshows, making it an</p>

<p>ideal choice for projects that require both clarity and flexibility in presentation.</p>

Copier après la connexion




Conclusion

Despite challenges in accurately converting lists and links, PyMuPDF4LLM is a useful tool for converting PDFs to Markdown. It can work locally without the need for external language models, making it suitable for environments where internet access is unavailable.

Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!

source:dev.to
Déclaration de ce site Web
Le contenu de cet article est volontairement contribué par les internautes et les droits d'auteur appartiennent à l'auteur original. Ce site n'assume aucune responsabilité légale correspondante. Si vous trouvez un contenu suspecté de plagiat ou de contrefaçon, veuillez contacter admin@php.cn
Derniers articles par auteur
Tutoriels populaires
Plus>
Derniers téléchargements
Plus>
effets Web
Code source du site Web
Matériel du site Web
Modèle frontal
À propos de nous Clause de non-responsabilité Sitemap
Site Web PHP chinois:Formation PHP en ligne sur le bien-être public,Aidez les apprenants PHP à grandir rapidement!