The PHP DOMDocument documentation suggests that it supports UTF-8 encoding out of the box, but as the code sample provided demonstrates, this is not always the case. The issue arises because DOMDocument::loadHTML() expects a HTML string in a specific encoding, which is historically ISO-8859-1 (Latin-1).
To resolve this issue, we need to convert the string into an encoding that DOMDocument can handle. One option is to convert non-ASCII characters to HTML entities, effectively escaping them. This can be achieved using the mb_convert_encoding() function with the 'HTML-ENTITIES' target encoding.
Another approach is to hint at the encoding of the document by adding a tag to the beginning of the HTML string. This tag specifies the charset, in this case UTF-8:
<meta http-equiv="content-type" content="text/html; charset=utf-8">
This meta tag will be automatically placed in the
section of the document, ensuring that the DOMDocument properly recognizes the encoding.Here's an example that demonstrates the use of HTML entities:
$html = '<meta http-equiv="content-type" content="text/html; charset=utf-8">
<html><head><title>Test!</title></head><body><h1>☆ Hello ☆ World ☆</h1></body></html>';
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);
header('Content-Type: text/html; charset=utf-8');
echo($dom->saveHTML());
By using either method, we can ensure that the DOMDocument can handle the UTF-8 characters correctly, allowing the program to output the desired result:
<meta http-equiv="content-type" content="text/html; charset=utf-8">Test! ☆ Hello ☆ World ☆
The above is the detailed content of Why does DOMDocument struggle with UTF-8 encoding when loading HTML strings in PHP?. For more information, please follow other related articles on the PHP Chinese website!