How to Extract Text Formatting Information from PDFs using iTextSharp?-C++-php.cn

How to Extract Text Formatting Information from PDFs using iTextSharp?

DDD

Release： 2025-01-11 11:13:44

Original

358 people have browsed it

How to Extract Text Formatting Information from PDFs using iTextSharp?

Use iTextSharp to obtain text format information

iTextSharp provides a simple text extraction system that can handle some basic markup. Although it does not handle color information, you can implement this functionality yourself. Here's a modified code snippet that combines various questions and answers to extract the text as HTML while capturing font information, including size and bold:

using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;

namespace WindowsFormsApplication2
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            PdfReader reader = new PdfReader("Document.pdf");
            TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
            string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
            Console.WriteLine(F);

            this.Close();
        }

        public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
        {
            //HTML缓冲区
            private StringBuilder result = new StringBuilder();

            //存储最后使用的属性
            private Vector lastBaseLine;
            private string lastFont;
            private float lastFontSize;

            //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
            private enum TextRenderMode
            {
                FillText = 0,
                StrokeText = 1,
                FillThenStrokeText = 2,
                Invisible = 3,
                FillTextAndAddToPathForClipping = 4,
                StrokeTextAndAddToPathForClipping = 5,
                FillThenStrokeTextAndAddToPathForClipping = 6,
                AddTextToPaddForClipping = 7
            }



            public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
            {
                string curFont = renderInfo.GetFont().PostscriptFontName;
                //检查是否使用了伪粗体
                if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))
                {
                    curFont += "-Bold";
                }

                //此代码假设如果基线发生变化，则表示换行
                Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
                Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
                iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
                Single curFontSize = rect.Height;

                //查看是否有任何更改，例如基线、字体或字体大小
                if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
                {
                    //如果我们已经放置了一个span标签，则关闭它
                    if ((this.lastBaseLine != null))
                    {
                        this.result.AppendLine("");
                    }
                    //如果基线已更改，则插入换行符
                    if ((this.lastBaseLine != null) &
                    curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
                    {
                        this.result.AppendLine("<br />");
                    }
                    //创建具有适当样式的HTML标签
                    this.result.AppendFormat("

Copy after login

Using this code, you can extract text from a PDF document while also capturing font properties such as font family, size, and bold. The code snippet is incomplete and needs to be supplemented to create and close the <span> tag and add text content to run completely.

The above is the detailed content of How to Extract Text Formatting Information from PDFs using iTextSharp?. For more information, please follow other related articles on the PHP Chinese website!