Home > Backend Development > PHP Tutorial > Why is my PHP DOMDocument loadHTML function not handling UTF-8 encoding correctly?

Why is my PHP DOMDocument loadHTML function not handling UTF-8 encoding correctly?

Patricia Arquette
Release: 2024-12-11 19:59:15
Original
1053 people have browsed it

Why is my PHP DOMDocument loadHTML function not handling UTF-8 encoding correctly?

PHP DOMDocument loadHTML Not Encoding UTF-8 Correctly

Issue

You're using DOMDocument to parse HTML, but the encoding appears to be lost when you load the HTML. Japanese characters in the HTML are replaced with garbled text, while they display correctly when outputting the HTML string directly through echo.

Cause

DOMDocument assumes the input string to be in ISO-8859-1 (the HTTP/1.1 default character set) by default. When parsing UTF-8 strings, this incorrect assumption results in misinterpretation, leading to garbled characters.

Solution

To ensure DOMDocument loads the HTML string with the correct encoding, you have several options:

  1. Prepend an XML Encoding Declaration or Meta Charset Declaration: Before loading the HTML string, add or . This forces the string to be treated as UTF-8.
  2. Use SmartDOMDocument: This external library offers a loadHTMLCharset function that automatically detects and handles the correct encoding.
  3. Convert the String to HTML Entities: PHP's mb_convert_encoding function can convert the HTML string to HTML entities using the UTF-8 encoding. Load this converted string into the DOMDocument.
  4. Use mb_encode_numericentity: This function encodes high-ASCII bytes with numeric entities, ensuring the string can be parsed correctly even with ISO-8859-1 limitations (PHP 8.2 only).

Example

Here's an example using a meta charset declaration:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();

// Add meta charset declaration
$contentType = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">';
$dom->loadHTML($contentType . $profile);

echo $dom->saveHTML();
Copy after login

This will load the HTML string with the correct UTF-8 encoding, preserving the original Japanese characters.

The above is the detailed content of Why is my PHP DOMDocument loadHTML function not handling UTF-8 encoding correctly?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template