Home > Backend Development > C++ > How to Extract PDF Content Efficiently using iTextSharp in C# or VB.NET?

How to Extract PDF Content Efficiently using iTextSharp in C# or VB.NET?

Barbara Streisand
Release: 2025-01-06 07:46:40
Original
830 people have browsed it

How to Extract PDF Content Efficiently using iTextSharp in C# or VB.NET?

Extracting PDF Content using iTextSharp

Question:

How to effectively retrieve the content of a PDF document using iTextSharp in either VB.NET or C#?

Answer:

iTextSharp provides a reliable mechanism for reading PDF content through its PdfReader class. Here's a comprehensive C# solution to extract both text and images from a PDF document:

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;

namespace PdfContentReader
{
    public static class Program
    {
        public static string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);
                }
                pdfReader.Close();
            }
            return text.ToString();
        }

        public static void Main(string[] args)
        {
            string fileName = @"path\to\file.pdf";
            string extractedText = ReadPdfFile(fileName);

            Console.WriteLine(extractedText);
        }
    }
}
Copy after login

In this implementation:

  • The ReadPdfFile method takes the filename as an argument and extracts the text content from each page of the PDF document.
  • We use the SimpleTextExtractionStrategy to extract plain text from the PDF document.
  • We handle potential encoding issues by converting the extracted text to UTF-8 encoding.

This solution efficiently extracts the text content from the PDF document, handling both plain text and embedded images effectively.

The above is the detailed content of How to Extract PDF Content Efficiently using iTextSharp in C# or VB.NET?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template