.NET を使用して動的に生成された HTML コードを制限なく取得する方法-jsチュートリアル-php.cn

.NET を使用して動的に生成された HTML コードを制限なく取得する方法

Linda Hamilton

リリース： 2024-10-18 08:40:03

オリジナル

993 人が閲覧しました

How to Retrieve Dynamically Generated HTML Code Using .NET without Limitations?

.NET の WebBrowser または mshtml.HTMLDocument を使用して HTML コードを動的に生成する方法?

概要

HTML コードを動的に取得するWeb ページによって生成される処理は、Web オートメーションおよびスクレイピングシナリオにおける一般的なタスクです。 .NET では、これを実現するために、System.Windows.Forms.WebBrowser クラスと mshtml.HTMLDocument インターフェイスという 2 つのオプションが提供されています。ただし、これらを効果的に使用するのは難しい場合があります。

WebBrowser クラス

System.Windows.Forms.WebBrowser クラスは、アプリケーション内に Web ページを埋め込むように設計されています。カスタムナビゲーションとドキュメントイベントをサポートしていますが、動的に生成された HTML をキャプチャする機能には制限があります。

次のコードスニペットは WebBrowser の使用を示しています。

<code class="csharp">using System.Windows.Forms;
using mshtml;

namespace WebBrowserTest
{
    public class Program
    {
        public static void Main()
        {
            WebBrowser wb = new WebBrowser();
            wb.Navigate("https://www.google.com/#q=where+am+i");

            wb.DocumentCompleted += delegate(object sender, WebBrowserDocumentCompletedEventArgs e)
            {
                mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)wb.Document.DomDocument;
                foreach (IHTMLElement element in doc.all)
                {
                    System.Diagnostics.Debug.WriteLine(element.outerHTML);
                }
            };
            Form f = new Form();
            f.Controls.Add(wb);
            Application.Run(f);
        }
    }
}</code>

ログイン後にコピー

mshtml.HTMLDocumentインターフェイス

mshtml.HTMLDocument インターフェイスは、基になる HTML ドキュメントオブジェクトへの直接アクセスを提供します。ただし、手動のナビゲーションとレンダリングが必要なため、動的に生成されたコンテンツの利便性が低くなります。

次のコードスニペットは、mshtml.HTMLDocument の使用を示しています。

<code class="csharp">using mshtml;

namespace HTMLDocumentTest
{
    public class Program
    {
        public static void Main()
        {
            mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)new mshtml.HTMLDocument();
            doc.write(new System.Net.WebClient().DownloadString("https://www.google.com/#q=where+am+i"));

            foreach (IHTMLElement e in doc.all)
            {
                System.Diagnostics.Debug.WriteLine(e.outerHTML);
            }
        }
    }
}</code>

ログイン後にコピー

より堅牢なアプローチ

WebBrowser と mshtml.HTMLDocument の制限を克服するには、次の方法を使用できます。

WebBrowser コントロールを作成します。
ターゲット URL を取得し、DocumentCompleted イベントを処理して、基になる mshtml.HTMLDocument2 オブジェクトを取得します。
WebBrowser.IsBusy のポーリングとチェックを組み合わせて使用し、ページのレンダリングがいつ終了したかを検出します。
ルートを取得します

サンプルコード

次の C# コードは、このアプローチを示しています。

<code class="csharp">using System;
using System.ComponentModel;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;
using mshtml;

namespace DynamicHTMLFetcher
{
    public partial class MainForm : Form
    {
        public MainForm()
        {
            InitializeComponent();
            this.webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted;
            this.Load += MainForm_Load;
        }

        private async void MainForm_Load(object sender, EventArgs e)
        {
            try
            {
                var cts = new CancellationTokenSource(10000); // cancel in 10s
                var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token);
                MessageBox.Show(html.Substring(0, 1024) + "..."); // it's too long!
            }
            catch (Exception ex)
            {
                MessageBox.Show(ex.Message);
            }
        }

        private async Task<string> LoadDynamicPage(string url, CancellationToken token)
        {
            var tcs = new TaskCompletionSource<bool>();
            WebBrowserDocumentCompletedEventHandler handler = (s, arg) => tcs.TrySetResult(true);

            using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
            {
                this.webBrowser.DocumentCompleted += handler;
                try
                {
                    this.webBrowser.Navigate(url);
                    await tcs.Task; // wait for DocumentCompleted
                }
                finally
                {
                    this.webBrowser.DocumentCompleted -= handler;
                }
            }

            var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];
            var html = documentElement.OuterHtml;
            while (true)
            {
                await Task.Delay(500, token);
                if (this.webBrowser.IsBusy)
                    continue;
                var htmlNow = documentElement.OuterHtml;
                if (html == htmlNow)
                    break; // no changes detected, end the poll loop
                html = htmlNow;
            }

            token.ThrowIfCancellationRequested();
            return html;
        }

        private void WebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            // Intentional no-op handler, we receive the DocumentCompleted event in the calling method.
        }
    }
}</code>

ログイン後にコピー

このアプローチにより、動的に生成された場合でも、完全にレンダリングされた HTML コードを確実に取得できます。

以上が.NET を使用して動的に生成された HTML コードを制限なく取得する方法の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。