Table of Contents
Cause
Solution
Home Backend Development PHP Tutorial Why does DOMDocument fail to handle UTF-8 characters correctly when loading HTML?

Why does DOMDocument fail to handle UTF-8 characters correctly when loading HTML?

Nov 04, 2024 am 10:12 AM

Why does DOMDocument fail to handle UTF-8 characters correctly when loading HTML?

DOMDocument's Inability to Handle UTF-8 Characters

In a scenario where a webserver is transmitting responses with UTF-8 encoding, all files are likewise saved in UTF-8, and all pertinent settings have been configured for UTF-8 encoding, an issue arises. A test program designed to verify output function demonstrates irregular behavior.

Upon executing the program, the output is rendered as follows:

<!DOCTYPE html>
<html><head><meta charset="utf-8"><title>Test!</title></head><body>
    <h1>☆ Hello ☆ World ☆</h1>    
</body></html>
Copy after login

which presents as:

<h1>☆ Hello ☆ World ☆</h1>


The program:

<code class="php">$html = &lt;&lt;&lt;HTML
&lt;!doctype html&gt;
&lt;html&gt;
&lt;head&gt;
    &lt;meta charset=&quot;utf-8&quot;&gt;
    &lt;title&gt;Test!&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;h1&gt;☆ Hello ☆ World ☆&lt;/h1&gt;
&lt;/body&gt;
&lt;/html&gt;
HTML;

$dom = new DOMDocument("1.0", "utf-8");
$dom-&gt;loadHTML($html);

header("Content-Type: text/html; charset=utf-8");
echo($dom-&gt;saveHTML());</code>
Copy after login

Cause

The underlying cause is that DOMDocument::loadHTML() anticipates a string in HTML format. HTML inherently utilizes ISO-8859-1 (ISO Latin Alphabet No. 1) as its default character encoding. Consequently, when an HTML parser designed for HTML 4.0 encounters characters exceeding this encoding, it may exhibit unpredictable behavior.

Solution

Converting Non-ASCII Characters to Entities

To rectify this issue, all characters outside the ASCII range (127 / h7F) should be converted into HTML entities. This process can be achieved employing mb_convert_encoding with the HTML-ENTITIES target encoding:

<code class="php">$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");</code>
Copy after login

Adding Content-Type Meta Tag

Alternatively, the issue can be resolved by incorporating a <meta> tag into the document itself, specifying the charset as UTF-8:

<code class="html">&lt;meta http-equiv=&quot;content-type&quot; content=&quot;text/html; charset=utf-8&quot;&gt;</code>
Copy after login

This method serves as a hint to the DOMDocument, coercing it to interpret the input as UTF-8 encoded. Even if positioned outside the <head> section, HTML 2.0 specifications dictate that such elements will be automatically relocated within the header.

The above is the detailed content of Why does DOMDocument fail to handle UTF-8 characters correctly when loading HTML?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot Article Tags

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

11 Best PHP URL Shortener Scripts (Free and Premium) 11 Best PHP URL Shortener Scripts (Free and Premium) Mar 03, 2025 am 10:49 AM

11 Best PHP URL Shortener Scripts (Free and Premium)

Working with Flash Session Data in Laravel Working with Flash Session Data in Laravel Mar 12, 2025 pm 05:08 PM

Working with Flash Session Data in Laravel

Build a React App With a Laravel Back End: Part 2, React Build a React App With a Laravel Back End: Part 2, React Mar 04, 2025 am 09:33 AM

Build a React App With a Laravel Back End: Part 2, React

Simplified HTTP Response Mocking in Laravel Tests Simplified HTTP Response Mocking in Laravel Tests Mar 12, 2025 pm 05:09 PM

Simplified HTTP Response Mocking in Laravel Tests

cURL in PHP: How to Use the PHP cURL Extension in REST APIs cURL in PHP: How to Use the PHP cURL Extension in REST APIs Mar 14, 2025 am 11:42 AM

cURL in PHP: How to Use the PHP cURL Extension in REST APIs

12 Best PHP Chat Scripts on CodeCanyon 12 Best PHP Chat Scripts on CodeCanyon Mar 13, 2025 pm 12:08 PM

12 Best PHP Chat Scripts on CodeCanyon

Notifications in Laravel Notifications in Laravel Mar 04, 2025 am 09:22 AM

Notifications in Laravel

Announcement of 2025 PHP Situation Survey Announcement of 2025 PHP Situation Survey Mar 03, 2025 pm 04:20 PM

Announcement of 2025 PHP Situation Survey

See all articles