Extracting Dynamically Generated HTML using .NET's WebBrowser and mshtml.HTMLDocument
Fetching dynamically generated HTML content presents a challenge when using .NET's WebBrowser
or mshtml.HTMLDocument
individually. A superior method combines both, as shown in the code example below:
<code class="language-csharp">using Microsoft.Win32; using System; using System.ComponentModel; using System.Diagnostics; using System.Threading; using System.Threading.Tasks; using System.Windows.Forms; namespace WbFetchPage { public partial class MainForm : Form { public MainForm() { SetFeatureBrowserEmulation(); InitializeComponent(); this.Load += MainForm_Load; } // Initiate the asynchronous HTML retrieval async void MainForm_Load(object sender, EventArgs e) { try { var cts = new CancellationTokenSource(10000); // 10-second timeout var html = await LoadDynamicPage("https://www.google.com/#q=where+am+i", cts.Token); MessageBox.Show(html.Substring(0, 1024) + "..." ); // Display a truncated result } catch (Exception ex) { MessageBox.Show(ex.Message); } } // Asynchronous function to retrieve the HTML content async Task<string> LoadDynamicPage(string url, CancellationToken token) { // Navigate and wait for DocumentCompleted event var tcs = new TaskCompletionSource<bool>(); WebBrowserDocumentCompletedEventHandler handler = (s, arg) => tcs.TrySetResult(true); using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true)) { this.webBrowser.DocumentCompleted += handler; try { this.webBrowser.Navigate(url); await tcs.Task; // Wait for page load } finally { this.webBrowser.DocumentCompleted -= handler; } } // Get the root HTML element var documentElement = this.webBrowser.Document.GetElementsByTagName("html")[0]; // Asynchronously poll for HTML changes string html = documentElement.OuterHtml; while (true) { // Wait asynchronously (cancellation possible) await Task.Delay(500, token); // Continue polling if the browser is busy if (this.webBrowser.IsBusy) continue; string htmlNow = documentElement.OuterHtml; if (html == htmlNow) break; // No changes, exit loop html = htmlNow; } // Check for cancellation token.ThrowIfCancellationRequested(); return html; } // Enable HTML5 emulation (for IE10+) // More details: https://stackoverflow.com/a/18333982/1768303 static void SetFeatureBrowserEmulation() { if (LicenseManager.UsageMode != LicenseUsageMode.Runtime) return; var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName); Registry.SetValue(@"HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION", appName, 10000, RegistryValueKind.DWord); } } }</code>
This code uses the WebBrowser
to navigate and the DocumentCompleted
event for initial page load. It then employs an asynchronous polling mechanism (Task.Delay()
) to monitor changes in the OuterHtml
property. The loop terminates when no further changes are detected and the browser is idle, returning the fully rendered HTML. This robust approach effectively handles dynamic web content.
The above is the detailed content of How to Efficiently Retrieve Dynamically Generated HTML Using .NET's WebBrowser and mshtml.HTMLDocument?. For more information, please follow other related articles on the PHP Chinese website!