Home > Backend Development > C++ > How to Extract Text with Formatting from PDFs Using iTextSharp?

How to Extract Text with Formatting from PDFs Using iTextSharp?

Mary-Kate Olsen
Release: 2025-01-11 10:46:41
Original
874 people have browsed it

How to Extract Text with Formatting from PDFs Using iTextSharp?

Extract formatted text using iTextSharp

Introduction:

iTextSharp is a powerful library for manipulating and generating PDF documents, but it is sometimes difficult to extract text with the desired format. This article provides a method to extract text and formatting information from PDF using iTextSharp.

Custom extraction strategy:

To extract formatted text, you can create a custom ITextExtractionStrategy implementation. This policy defines how text rendering information is handled.

Code snippet:

The following code defines a custom strategy that tracks changes in baseline, font name, and font size and generates HTML with appropriate styling:

<code>public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
{
    // ... (此处省略)

    public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
    {
        // 确定字体属性
        string curFont = renderInfo.GetFont().PostscriptFontName;
        if (renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText)
        {
            curFont += "-Bold";
        }

        // 检查基线、字体或字体大小的变化
        Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
        Single curFontSize = renderInfo.GetAscentLine().GetEndPoint()[Vector.I2] - curBaseline[Vector.I2];
        if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) ||
            (curFontSize != lastFontSize) || (curFont != lastFont))
        {
            // 生成带有更新样式的HTML span
            result.AppendFormat("</code>
Copy after login

Usage:

To use a custom strategy, you can specify it when extracting text:

<code>PdfReader reader = new PdfReader("MyDocument.pdf");
TextWithFontExtractionStategy strategy = new TextWithFontExtractionStategy();
string textWithFormatting = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);</code>
Copy after login

Output:

The

textWithFormatting variable will contain the extracted text with HTML tags reflecting the formatting information, including font and font size.

Conclusion:

This custom extraction strategy allows you to extract PDF text with the desired format. This is a powerful tool that can be used to accurately reproduce text and styles in PDF documents.

The above is the detailed content of How to Extract Text with Formatting from PDFs Using iTextSharp?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template