Failed to Encode UTF-8 with PHP DOMDocument::loadHTML
In certain scenarios, attempting to parse HTML using DOMDocument::loadHTML can result in encoding issues, particularly when UTF-8 encoding is involved. This article explores the reasons behind these problems and provides several solutions to address them effectively.
Cause of the Issue
By default, DOMDocument treats strings as encoded in ISO-8859-1, which is the HTTP/1.1 default character set. However, UTF-8 strings are interpreted incorrectly under this assumption, leading to encoding errors.
Alternative Solutions
1. Prepending Encoding Declarations
For straightforward (X)HTML snippets, prepend an XML or meta charset declaration to instruct the parser to treat the string as UTF-8:
$contentType = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">'; $dom->loadHTML($contentType . $profile); $dom->loadHTML('<meta charset="utf8">' . $profile);
2. Using HTML SmartDOMDocument
This workaround can be applied if prior encoding declarations cannot be determined:
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
3. PHP 8.2 Workaround
For PHP 8.2 , use the following approach:
$dom->loadHTML(mb_encode_numericentity($profile, [0x80, 0x10FFFF, 0, ~0], 'UTF-8'));
Conclusion
By understanding the cause of encoding problems and employing the appropriate solutions, developers can effectively parse HTML with UTF-8 encoding using PHP's DOMDocument::loadHTML method.
The above is the detailed content of Why Does PHP DOMDocument::loadHTML Fail with UTF-8 Encoding, and How Can I Fix It?. For more information, please follow other related articles on the PHP Chinese website!