file_get_contents() Interrupts UTF-8 Characters
The issue arises when loading HTML from an external server with UTF-8 encoding. Characters like ľ, š, č, ť, ž are corrupted and replaced with invalid characters.
The Root of the Problem
The file_get_contents() function may be encountering encoding issues. By default, it interprets the data as ASCII, which fails to handle UTF-8 characters correctly.
Proposed Solution
To resolve this, consider using an alternative encoding method.
1. Manual Encoding Conversion
Use the mb_convert_encoding() function to convert the fetched HTML to UTF-8:
$html = file_get_contents('http://example.com/foreign.html'); $utf8_html = mb_convert_encoding($html, 'UTF-8', mb_detect_encoding($html, 'UTF-8', true));
2. Output Encoding
Ensure the output is properly encoded by adding the following line to the script:
header('Content-Type: text/html; charset=UTF-8');
3. HTML Entity Conversion
Convert the fetched HTML to HTML entities before outputting it:
$html = file_get_contents('http://example.com/foreign.html'); $html_entities = htmlentities($html, ENT_COMPAT, 'UTF-8'); echo $html_entities;
4. JSON Decoding
If the external HTML is stored as JSON, decode it using the JSON class:
$json = file_get_contents('http://example.com/foreign.html'); $decoded_json = json_decode($json, true); $html = $decoded_json['html'];
By utilizing these techniques, you can circumvent the encoding issues caused by file_get_contents() and ensure the proper display of UTF-8 characters.
The above is the detailed content of Why are UTF-8 Characters Corrupted When Using `file_get_contents()`?. For more information, please follow other related articles on the PHP Chinese website!