iTextSharp を使用して PDF からテキスト書式設定情報を抽出する方法-C++-php.cn

iTextSharp を使用して PDF からテキスト書式設定情報を抽出する方法

DDD

リリース： 2025-01-11 11:13:44

オリジナル

314 人が閲覧しました

How to Extract Text Formatting Information from PDFs using iTextSharp?

iTextSharp を使用してテキスト形式情報を取得します

iTextSharp は、いくつかの基本的なマークアップを処理できるシンプルなテキスト抽出システムを提供します。色情報は処理しませんが、この機能は自分で実装できます。以下は、さまざまな質問と回答を組み合わせて、サイズや太字などのフォント情報を取得しながらテキストを HTML として抽出する、変更されたコードスニペットです:

<code class="language-csharp">using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text.pdf.parser;
using iTextSharp.text.pdf;

namespace WindowsFormsApplication2
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            PdfReader reader = new PdfReader("Document.pdf");
            TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
            string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
            Console.WriteLine(F);

            this.Close();
        }

        public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
        {
            //HTML缓冲区
            private StringBuilder result = new StringBuilder();

            //存储最后使用的属性
            private Vector lastBaseLine;
            private string lastFont;
            private float lastFontSize;

            //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
            private enum TextRenderMode
            {
                FillText = 0,
                StrokeText = 1,
                FillThenStrokeText = 2,
                Invisible = 3,
                FillTextAndAddToPathForClipping = 4,
                StrokeTextAndAddToPathForClipping = 5,
                FillThenStrokeTextAndAddToPathForClipping = 6,
                AddTextToPaddForClipping = 7
            }



            public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
            {
                string curFont = renderInfo.GetFont().PostscriptFontName;
                //检查是否使用了伪粗体
                if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))
                {
                    curFont += "-Bold";
                }

                //此代码假设如果基线发生变化，则表示换行
                Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
                Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
                iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
                Single curFontSize = rect.Height;

                //查看是否有任何更改，例如基线、字体或字体大小
                if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
                {
                    //如果我们已经放置了一个span标签，则关闭它
                    if ((this.lastBaseLine != null))
                    {
                        this.result.AppendLine("");
                    }
                    //如果基线已更改，则插入换行符
                    if ((this.lastBaseLine != null) &
                    curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
                    {
                        this.result.AppendLine("<br />");
                    }
                    //创建具有适当样式的HTML标签
                    this.result.AppendFormat("</code>

ログイン後にコピー

このコードを使用すると、フォントファミリ、サイズ、太字などのフォントプロパティをキャプチャしながら、PDF ドキュメントからテキストを抽出できます。コードスニペットは不完全であるため、完全に実行するには、<span> タグを作成して閉じ、テキストコンテンツを追加するために補足する必要があります。

以上がiTextSharp を使用して PDF からテキスト書式設定情報を抽出する方法の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。