Update
February 27, 2014: This article originally only described using PDFBox to parse PDF files. It has now been extended to include routines for using IFilter and iTextSharp.
This article and the corresponding Visual Studio project have been updated to the latest PDFBox version (1.8.4). The complete project including all dependencies can be downloaded from http://www.squarepdf.net/how-to-convert-pdf-to-text-in-net-sample-project/ (removing dependencies is a bit tricky).
How to parse PDF files
Several main methods to extract text from PDF files in .NET are:
Microsoft’s IFilter interface and Adobe’s IFilter implementation;
iTextSharp;
PDFBox.
Unfortunately, none of these PDF parsing solutions are perfect. We discuss these methods below.
Adobe PDF IFilter
To use the IFilter interface to parse PDF files, you need:
Windows 2000 or later
Adobe Acrobat or Reader 7.0.5+ (or standalone Adobe PDF IFilter [adobe.com])
IFilter COM encapsulation class [dotlucene.net]
Sample code:
using IFilter; // ... public static string ExtractTextFromPdf(string path) { return DefaultParser.Extract(path); }
Disadvantages:
Uses unreliable COM interop to handle the IFilter interface (and combining IFilter COM and Adobe PDF IFilter is particularly troublesome).
Requires Adobe IFilter to be installed separately on the target system. It's a pain if you need to publish an indexable solution to others.
iTextSharp
iTextSharp (http://sourceforge.net/projects/itextsharp/) is a Java PDF operation library iText (http://itextpdf.com/) .NET output. It's primarily focused on editing PDFs rather than reading them, but it certainly supports extracting text from PDFs as well (although it's a bit overkill).
Routine:
using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; // ... public static string ExtractTextFromPdf(string path) { using (PdfReader reader = new PdfReader(path)) { StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++) { text.Append(PdfTextExtractor.GetTextFromPage(reader, i)); } return text.ToString(); } }
Credit: Member number 10364982
Disadvantages:
Requires a license (if you don’t like AGPL license)
PDFBox
PDFBox is another Java PDF class library. It can also be used with original Java Lucene (see LucenePDFDocument).
Fortunately, PDFBox has a .NET version developed using IKVM.NET (just visit the PDFBox download page).
To use PDFBox in .NET, you need to quote:
IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
pdfbox-1.8.4.dll
And copy the following files to the bin folder :
commons-logging.dll
fontbox-1.8.4.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll
It is very simple to use PDFBox to parse PDF:
using org.apache.pdfbox.pdmodel; using org.apache.pdfbox.util; // ... private static string ExtractTextFromPdf(string path) { PDDocument doc = null; try { doc = PDDocument.load(path) PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); } finally { if (doc != null) { doc.close(); } } }
The compiled size increases It's almost 18MB in total:
IKVM.OpenJDK.Core.dll (4 MB)
IKVM.OpenJDK.SwingAWT.dll (6 MB)
pdfbox-1.8.4.dll (4 MB)
commons-logging. dll (82 kB)
fontbox-1.8.4.dll (180 kB)
IKVM.OpenJDK.Util.dll (2 MB)
IKVM.Runtime.dll (1 MB)
Speed is OK: parsing U.S. Copyright Act PDF (5.1 MB) file took 13 seconds.
Thanks bobrien100 for the improvement suggestions.
Disadvantages:
IKVM.NET dependency (18 MB)
Speed (especially the startup time of IKVM.NET)