PHP DOMDocument loadHTML Cannot Encode UTF-8 Correctly
DOMDocument's loadHTML method assumes your input is encoded in ISO-8859-1, which can lead to incorrect encoding of UTF-8 characters.
The underlying parser used by DOMDocument expects HTML4 input, potentially causing challenges with HTML5 documents.
Solution:
To resolve this issue, specify the character encoding of your HTML using one of the following methods:
XML Encoding Declaration:
ContentType Header:
XML Encoding Prefix:
Workaround for Unknown HTML Content:
If you cannot make assumptions about the encoding, employ a workaround like SmartDOMDocument or the following PHP code:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>'; $dom = new DOMDocument(); $dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8')); echo $dom->saveHTML();
Caution for PHP 8.2 :
In PHP 8.2 , the mb_convert_encoding function will generate a deprecation warning. As an alternative:
$dom->loadHTML(mb_encode_numericentity($profile, [0x80, 0x10FFFF, 0, ~0], 'UTF-8'));
While not ideal, this method ensures safe encoding as all characters can be represented in ISO-8859-1.
The above is the detailed content of Why Does PHP DOMDocument's loadHTML Fail with UTF-8 Encoding, and How Can I Fix It?. For more information, please follow other related articles on the PHP Chinese website!