How to Extract PDF Content Efficiently using iTextSharp in C# or VB.NET?-C++-php.cn

How to Extract PDF Content Efficiently using iTextSharp in C# or VB.NET?

Barbara Streisand

Release： 2025-01-06 07:46:40

Original

905 people have browsed it

How to Extract PDF Content Efficiently using iTextSharp in C# or VB.NET?

Extracting PDF Content using iTextSharp

Question:

How to effectively retrieve the content of a PDF document using iTextSharp in either VB.NET or C#?

Answer:

iTextSharp provides a reliable mechanism for reading PDF content through its PdfReader class. Here's a comprehensive C# solution to extract both text and images from a PDF document:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;

namespace PdfContentReader
{
    public static class Program
    {
        public static string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);
                }
                pdfReader.Close();
            }
            return text.ToString();
        }

        public static void Main(string[] args)
        {
            string fileName = @"path\to\file.pdf";
            string extractedText = ReadPdfFile(fileName);

            Console.WriteLine(extractedText);
        }
    }
}

Copy after login

In this implementation:

The ReadPdfFile method takes the filename as an argument and extracts the text content from each page of the PDF document.
We use the SimpleTextExtractionStrategy to extract plain text from the PDF document.
We handle potential encoding issues by converting the extracted text to UTF-8 encoding.

This solution efficiently extracts the text content from the PDF document, handling both plain text and embedded images effectively.

The above is the detailed content of How to Extract PDF Content Efficiently using iTextSharp in C# or VB.NET?. For more information, please follow other related articles on the PHP Chinese website!