Home > Backend Development > C++ > How Can I Retrieve Text Formatting (Font, Size, Style) from a PDF Using iTextSharp?

How Can I Retrieve Text Formatting (Font, Size, Style) from a PDF Using iTextSharp?

Barbara Streisand
Release: 2025-01-11 10:56:42
Original
494 people have browsed it

How Can I Retrieve Text Formatting (Font, Size, Style) from a PDF Using iTextSharp?

How to extract text format using iTextSharp

Although iTextSharp provides an efficient text extraction method, it may have shortcomings in retaining formatting details such as fonts, colors, and sizes. To overcome this limitation, we explored an alternative approach.

Customized text extraction strategy

The custom TextWithFontExtractionStategy class extends the ITextExtractionStrategy interface to capture format information. In the RenderText method:

  • It monitors font names, pseudo-bold usage, baseline changes, and font size changes.
  • If any of these attributes change, it will close the current HTML span tag and create a new one with the corresponding styles.

Example output

The following C# code demonstrates how to extract text and font-related formatting from a PDF:

<code class="language-csharp">StringBuilder result = new StringBuilder();
PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Document.pdf"));
TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
Console.WriteLine(F);</code>
Copy after login

The generated HTML output contains tags for font family, font size, and font style.

Other considerations

  • PostscriptFontName may contain additional characters, which may be related to font subsetting.
  • The example code assumes that changes in the baseline represent newlines in HTML.
  • The extraction process currently does not capture color information, but there are indications that this can be achieved manually.

The above is the detailed content of How Can I Retrieve Text Formatting (Font, Size, Style) from a PDF Using iTextSharp?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template