Why does DOMDocument struggle with UTF-8 encoding when loading HTML strings in PHP?-PHP Tutorial-php.cn

Why does DOMDocument struggle with UTF-8 encoding when loading HTML strings in PHP?

DDD

Release： 2024-11-04 09:33:30

Original

674 people have browsed it

Why does DOMDocument struggle with UTF-8 encoding when loading HTML strings in PHP?

DOMDocument Encoding Woes

The PHP DOMDocument documentation suggests that it supports UTF-8 encoding out of the box, but as the code sample provided demonstrates, this is not always the case. The issue arises because DOMDocument::loadHTML() expects a HTML string in a specific encoding, which is historically ISO-8859-1 (Latin-1).

Converting Strings to HTML Entities

To resolve this issue, we need to convert the string into an encoding that DOMDocument can handle. One option is to convert non-ASCII characters to HTML entities, effectively escaping them. This can be achieved using the mb_convert_encoding() function with the 'HTML-ENTITIES' target encoding.

Adding a Content-Type Meta Tag

Another approach is to hint at the encoding of the document by adding a tag to the beginning of the HTML string. This tag specifies the charset, in this case UTF-8:

<meta http-equiv="content-type" content="text/html; charset=utf-8">

Copy after login

This meta tag will be automatically placed in the section of the document, ensuring that the DOMDocument properly recognizes the encoding.

Example Code

Here's an example that demonstrates the use of HTML entities:

$html = '&lt;meta http-equiv=&quot;content-type&quot; content=&quot;text/html; charset=utf-8&quot;&gt;
<html><head><title>Test!</title></head><body><h1>☆ Hello ☆ World ☆</h1></body></html>';

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);

header('Content-Type: text/html; charset=utf-8');
echo($dom->saveHTML());

Copy after login

By using either method, we can ensure that the DOMDocument can handle the UTF-8 characters correctly, allowing the program to output the desired result:




    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    Test!


    ☆ Hello ☆ World ☆

Copy after login

The above is the detailed content of Why does DOMDocument struggle with UTF-8 encoding when loading HTML strings in PHP?. For more information, please follow other related articles on the PHP Chinese website!