Use iTextSharp to read non-English PDF content
When using iTextSharp in C# to extract text from PDF documents, users may encounter issues if the content is in a non-English language (such as Farsi or Arabic). This may result in garbled text because the built-in encoding methods cannot handle these character sets.
To resolve this issue, be sure to avoid performing any unnecessary encoding conversions on text obtained from PDF. In iTextSharp, the PdfTextExtractor.GetTextFromPage()
method extracts raw text from a PDF page. Conversion to Unicode should be handled later in a controlled manner.
The provided code snippet attempts to use Encoding.UTF8
to re-encode the text, which is the wrong approach. The following simplified code snippet illustrates the correct approach:
<code class="language-csharp">public string ReadPdfFileWithoutEncoding(string fileName) { StringBuilder text = new StringBuilder(); if (File.Exists(fileName)) { PdfReader pdfReader = new PdfReader(fileName); for (int page = 1; page <= pdfReader.NumberOfPages; page++) { text.Append(PdfTextExtractor.GetTextFromPage(pdfReader, page)); } pdfReader.Close(); } return text.ToString(); }</code>
Please note that it is important to ensure that your application is using the latest version of iTextSharp. Older versions may have limitations in handling non-English text. Additionally, the application responsible for displaying the extracted text must support Unicode characters.
The above is the detailed content of How Can I Extract Non-English Text from PDFs Using iTextSharp in C# Without Garbled Output?. For more information, please follow other related articles on the PHP Chinese website!