Why does DOMDocument struggle with UTF-8 encoding when loading HTML strings in PHP?

DDD
Release: 2024-11-04 09:33:30
Original
563 people have browsed it

Why does DOMDocument struggle with UTF-8 encoding when loading HTML strings in PHP?

DOMDocument Encoding Woes

The PHP DOMDocument documentation suggests that it supports UTF-8 encoding out of the box, but as the code sample provided demonstrates, this is not always the case. The issue arises because DOMDocument::loadHTML() expects a HTML string in a specific encoding, which is historically ISO-8859-1 (Latin-1).

Converting Strings to HTML Entities

To resolve this issue, we need to convert the string into an encoding that DOMDocument can handle. One option is to convert non-ASCII characters to HTML entities, effectively escaping them. This can be achieved using the mb_convert_encoding() function with the 'HTML-ENTITIES' target encoding.

Adding a Content-Type Meta Tag

Another approach is to hint at the encoding of the document by adding a tag to the beginning of the HTML string. This tag specifies the charset, in this case UTF-8:

<meta http-equiv="content-type" content="text/html; charset=utf-8">
Copy after login

This meta tag will be automatically placed in the section of the document, ensuring that the DOMDocument properly recognizes the encoding.

Example Code

Here's an example that demonstrates the use of HTML entities:

$html = '&lt;meta http-equiv=&quot;content-type&quot; content=&quot;text/html; charset=utf-8&quot;&gt;
<html><head><title>Test!</title></head><body><h1>☆ Hello ☆ World ☆</h1></body></html>';

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);

header('Content-Type: text/html; charset=utf-8');
echo($dom->saveHTML());
Copy after login

By using either method, we can ensure that the DOMDocument can handle the UTF-8 characters correctly, allowing the program to output the desired result:




    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    Test!


    

☆ Hello ☆ World ☆

Copy after login

The above is the detailed content of Why does DOMDocument struggle with UTF-8 encoding when loading HTML strings in PHP?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!