How to extract text format using iTextSharp
Although iTextSharp provides an efficient text extraction method, it may have shortcomings in retaining formatting details such as fonts, colors, and sizes. To overcome this limitation, we explored an alternative approach.
Customized text extraction strategy
The custom TextWithFontExtractionStategy
class extends the ITextExtractionStrategy
interface to capture format information. In the RenderText
method:
Example output
The following C# code demonstrates how to extract text and font-related formatting from a PDF:
<code class="language-csharp">StringBuilder result = new StringBuilder(); PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Document.pdf")); TextWithFontExtractionStategy S = new TextWithFontExtractionStategy(); string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S); Console.WriteLine(F);</code>
The generated HTML output contains tags for font family, font size, and font style.
Other considerations
PostscriptFontName
may contain additional characters, which may be related to font subsetting. The above is the detailed content of How Can I Retrieve Text Formatting (Font, Size, Style) from a PDF Using iTextSharp?. For more information, please follow other related articles on the PHP Chinese website!