DOMDocument's Inability to Handle UTF-8 Characters
In a scenario where a webserver is transmitting responses with UTF-8 encoding, all files are likewise saved in UTF-8, and all pertinent settings have been configured for UTF-8 encoding, an issue arises. A test program designed to verify output function demonstrates irregular behavior.
Upon executing the program, the output is rendered as follows:
<!DOCTYPE html> <html><head><meta charset="utf-8"><title>Test!</title></head><body> <h1>☆ Hello ☆ World ☆</h1> </body></html>
which presents as:
The program:
<code class="php">$html = <<<HTML <!doctype html> <html> <head> <meta charset="utf-8"> <title>Test!</title> </head> <body> <h1>☆ Hello ☆ World ☆</h1> </body> </html> HTML; $dom = new DOMDocument("1.0", "utf-8"); $dom->loadHTML($html); header("Content-Type: text/html; charset=utf-8"); echo($dom->saveHTML());</code>
The underlying cause is that DOMDocument::loadHTML() anticipates a string in HTML format. HTML inherently utilizes ISO-8859-1 (ISO Latin Alphabet No. 1) as its default character encoding. Consequently, when an HTML parser designed for HTML 4.0 encounters characters exceeding this encoding, it may exhibit unpredictable behavior.
Converting Non-ASCII Characters to Entities
To rectify this issue, all characters outside the ASCII range (127 / h7F) should be converted into HTML entities. This process can be achieved employing mb_convert_encoding with the HTML-ENTITIES target encoding:
<code class="php">$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");</code>
Adding Content-Type Meta Tag
Alternatively, the issue can be resolved by incorporating a tag into the document itself, specifying the charset as UTF-8:
<code class="html"><meta http-equiv="content-type" content="text/html; charset=utf-8"></code>
This method serves as a hint to the DOMDocument, coercing it to interpret the input as UTF-8 encoded. Even if positioned outside the
section, HTML 2.0 specifications dictate that such elements will be automatically relocated within the header.The above is the detailed content of Why does DOMDocument fail to handle UTF-8 characters correctly when loading HTML?. For more information, please follow other related articles on the PHP Chinese website!