PHP DOMDocument loadHTML Not Encoding UTF-8 Correctly
When attempting to parse HTML using DOMDocument::loadHTML(), you may encounter issues with proper UTF-8 encoding. By default, DOMDocument treats input strings as ISO-8859-1, which can lead to errors when dealing with UTF-8 data.
Solution:
To ensure correct encoding, you can employ various methods:
Prepend Encoding Declarations: Add an XML encoding declaration or an HTML meta charset declaration to indicate the presence of UTF-8 characters:
$contentType = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">'; $dom->loadHTML($contentType . $profile);
Use SmartDOMDocument: If the input HTML may already contain declarations, use the SmartDOMDocument library to resolve potential conflicts:
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
Alternative: In PHP 8.2 , use mb_encode_numericentity() for a safer encoding option:
$dom->loadHTML(mb_encode_numericentity($profile, [0x80, 0x10FFFF, 0, ~0], 'UTF-8'));
HTML5 Considerations:
DOMDocument uses an HTML4 parser. For HTML5 documents, consider using alternative HTML parsers designed for HTML5 compliance.
Example:
The following code demonstrates the use of mb_convert_encoding() to correct incorrect UTF-8 encoding:
$profile = ""; $dom = new DOMDocument(); $dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8')); echo $dom->saveHTML();イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として
The above is the detailed content of Why is my PHP DOMDocument::loadHTML() Not Handling UTF-8 Encoding Correctly?. For more information, please follow other related articles on the PHP Chinese website!