How to Extract Text and Images from PDFs using iTextSharp in .NET?-C++-php.cn

How to Extract Text and Images from PDFs using iTextSharp in .NET?

DDD

Release： 2025-01-06 07:51:41

Original

987 people have browsed it

How to Extract Text and Images from PDFs using iTextSharp in .NET?

Extracting PDF Content with iTextSharp in .NET

In .NET applications, iTextSharp provides robust capabilities for handling PDF documents. One of its primary features is the ability to extract content from PDFs, including both text and images.

Reading Plain Text from PDFs

To read plain text from a PDF using iTextSharp, you can leverage the following code:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

public string ReadPdfText(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

Copy after login

In this example, the ReadPdfText method reads the contents of a PDF file and accumulates the text into a StringBuilder object. The SimpleTextExtractionStrategy is used to extract text from each page of the PDF.

Handling Images in PDFs

While the above code focuses on extracting text, iTextSharp also enables you to extract images from PDFs. You can use the following approach:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Drawing;
using System.IO;

public void ReadPdfImages(string fileName)
{
    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);
            string content = parser.ProcessContent(page, new ImageRenderListener());
        }
    }
}

Copy after login

In this code, a PdfReaderContentParser is used to parse the content of each page. The ImageRenderListener provides a callback method that handles the rendering of images. Each image is rendered as a Bitmap object, which can be further processed or saved.

The above is the detailed content of How to Extract Text and Images from PDFs using iTextSharp in .NET?. For more information, please follow other related articles on the PHP Chinese website!