function utf8_gb2312($str, $default = 'gb2312')
{
$str = preg_replace("/[x01-x7f]+/", "", $str);
If (empty($str)) return $default;
$preg = array(
"gb2312" => "/^([xa1-xf7][xa0-xfe])+$/", //Regular judgment whether it is gb2312
"utf-8" => "/^[x{4e00}-x{9fa5}]+$/u", //Regular judge whether it is a Chinese character (the condition of utf8 encoding), this range is actually included Traditional Chinese characters
);if ($default == 'gb2312') {
$option = 'utf-8';
} else {
$option = 'gb2312';
}if (!preg_match($preg[$default], $str)) {
return $option;
}
$str = @iconv($default, $option, $str);
//Cannot be converted to $option, indicating that the original one is not $default
If (empty($str)) {
return $option;
}
The default encoding is gb2312, and I counted it and found that 90% of the time it is gb2312. Therefore, my detection function cannot appear to be gb2312, and the result is detected as utf8. The basic idea is:
1. Remove all ascii. If all are ascii, then it is gb2312.
2. Assume this string is gb2312, use a regular expression to check whether it is true gb2312, if not, then it is utf-8
3. Then, use iconv to convert the string into utf8. If the conversion is unsuccessful, it may not be a real gb2312 encoded character
(I have tried to be as precise as possible with regular matching, but the encoding of gb2312 is not continuous and there will still be holes), then the final encoding is utf-8.
4. Otherwise it is gb2312 encoding
After adding such a checking function, there is only one garbled code among 1000 keywords, which is much less than the nearly 100 keywords before.