How to Retrieve Dynamically Generated HTML Code Using .NET without Limitations?-JS Tutorial-php.cn

How to Retrieve Dynamically Generated HTML Code Using .NET without Limitations?

Linda Hamilton

Release： 2024-10-18 08:40:03

Original

1015 people have browsed it

How to Retrieve Dynamically Generated HTML Code Using .NET without Limitations?

How to Dynamically Generate HTML Code Using .NET's WebBrowser or mshtml.HTMLDocument?

Introduction

Retrieving HTML code dynamically generated by web pages is a common task in web automation and scraping scenarios. .NET provides two options for achieving this: the System.Windows.Forms.WebBrowser class and the mshtml.HTMLDocument interface. However, using them effectively can be challenging.

The WebBrowser Class

The System.Windows.Forms.WebBrowser class is designed to embed web pages within your application. While it supports custom navigation and document events, it's limited in its ability to capture dynamically generated HTML.

The following code snippet illustrates using WebBrowser:

<code class="csharp">using System.Windows.Forms;
using mshtml;

namespace WebBrowserTest
{
    public class Program
    {
        public static void Main()
        {
            WebBrowser wb = new WebBrowser();
            wb.Navigate("https://www.google.com/#q=where+am+i");

            wb.DocumentCompleted += delegate(object sender, WebBrowserDocumentCompletedEventArgs e)
            {
                mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)wb.Document.DomDocument;
                foreach (IHTMLElement element in doc.all)
                {
                    System.Diagnostics.Debug.WriteLine(element.outerHTML);
                }
            };
            Form f = new Form();
            f.Controls.Add(wb);
            Application.Run(f);
        }
    }
}</code>

Copy after login

The mshtml.HTMLDocument Interface

The mshtml.HTMLDocument interface provides direct access to the underlying HTML document object. However, it requires manual navigation and rendering, making it less convenient for dynamically generated content.

The following code snippet illustrates using mshtml.HTMLDocument:

<code class="csharp">using mshtml;

namespace HTMLDocumentTest
{
    public class Program
    {
        public static void Main()
        {
            mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)new mshtml.HTMLDocument();
            doc.write(new System.Net.WebClient().DownloadString("https://www.google.com/#q=where+am+i"));

            foreach (IHTMLElement e in doc.all)
            {
                System.Diagnostics.Debug.WriteLine(e.outerHTML);
            }
        }
    }
}</code>

Copy after login

A More Robust Approach

To overcome the limitations of WebBrowser and mshtml.HTMLDocument, you can use the following approach:

Create a WebBrowser control.
Navigate to the target URL and handle the DocumentCompleted event to obtain the underlying mshtml.HTMLDocument2 object.
Use a combination of polling and checking WebBrowser.IsBusy to detect when the page has finished rendering.
Get the root element and poll its OuterHtml property until it becomes stable.

Example Code

The following C# code demonstrates this approach:

<code class="csharp">using System;
using System.ComponentModel;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;
using mshtml;

namespace DynamicHTMLFetcher
{
    public partial class MainForm : Form
    {
        public MainForm()
        {
            InitializeComponent();
            this.webBrowser.DocumentCompleted += WebBrowser_DocumentCompleted;
            this.Load += MainForm_Load;
        }

        private async void MainForm_Load(object sender, EventArgs e)
        {
            try
            {
                var cts = new CancellationTokenSource(10000); // cancel in 10s
                var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token);
                MessageBox.Show(html.Substring(0, 1024) + "..."); // it's too long!
            }
            catch (Exception ex)
            {
                MessageBox.Show(ex.Message);
            }
        }

        private async Task<string> LoadDynamicPage(string url, CancellationToken token)
        {
            var tcs = new TaskCompletionSource<bool>();
            WebBrowserDocumentCompletedEventHandler handler = (s, arg) => tcs.TrySetResult(true);

            using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
            {
                this.webBrowser.DocumentCompleted += handler;
                try
                {
                    this.webBrowser.Navigate(url);
                    await tcs.Task; // wait for DocumentCompleted
                }
                finally
                {
                    this.webBrowser.DocumentCompleted -= handler;
                }
            }

            var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0];
            var html = documentElement.OuterHtml;
            while (true)
            {
                await Task.Delay(500, token);
                if (this.webBrowser.IsBusy)
                    continue;
                var htmlNow = documentElement.OuterHtml;
                if (html == htmlNow)
                    break; // no changes detected, end the poll loop
                html = htmlNow;
            }

            token.ThrowIfCancellationRequested();
            return html;
        }

        private void WebBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            // Intentional no-op handler, we receive the DocumentCompleted event in the calling method.
        }
    }
}</code>

Copy after login

This approach ensures that you obtain the fully rendered HTML code even if it is dynamically generated.

The above is the detailed content of How to Retrieve Dynamically Generated HTML Code Using .NET without Limitations?. For more information, please follow other related articles on the PHP Chinese website!