A long time ago, I found a function that converts GB code into UTF-8, and used it with a GB to UNICODE comparison table (gb2312.txt) to output Chinese characters in GD. Later I found that when the content to be output contains Western characters, confusion will occur. Later I found the modified code and solved the problem. Now a comparative analysis of the two functions is as follows.
First of all, this is a function that converts UNICODE to UTF-8 encoding. There is no change in this part before and after the modification:
function u2utf8($c)
{
for($i=0 ;$i
if ($c < 0x80) {
$str.=$c;
}
else if ($c < 0x800) {
$str.=(0xC0 | $c>>6);
$str.=(0x80 | $c & 0x3F);
}
else if ($c < 0x10000) {
$str.=(0xE0 | $c>>12);
$str.=(0x80 | $c>>6 & 0x3F);
$str.=(0x80 | $c & 0x3F);
}
else if ($c < 0x200000) {
$str.=(0xF0 | $c>>18) ;
$str.=(0x80 | $c>>12 & 0x3F);
$str.=(0x80 | $c>>6 & 0x3F);
$str.=(0x80 | $c & 0x3F);
}
return $str;
}
This is completely in accordance with the UTF-8 encoding rules, by judging that the characters belong to different UNICODE encoding segment ranges, Perform different shifting and bit-AND operations to convert to UTF-8 encoding. For information about this rule, please refer to the instructions at http://www.utf8.org/.
This is the function that converts GB to UTF-8 encoding before modification, in which the u2utf8 function above is called.
function gb2utf8($gb) /* Program writen by sadly www.phpx.com */
{
if(!trim($gb))
return $gb;
$filename ="gb2312.txt";
$tmp=file($filename);
$codetable=array();
while(list($key,$value)=each($tmp))
$codetable[hexdec(substr($value,0,6))]=substr($value,7,6);
$utf8="";
while($gb)
{
if (ord(substr($gb,0,1))>127)
{
$this=substr($gb,0,2);
$gb=substr($ gb,2,strlen($gb));
$utf8.=u2utf8(hexdec($codetable[hexdec(bin2hex($this))-0x8080]));
}
else
{
$gb=substr($gb,1,strlen($gb));
$utf8.=u2utf8(substr($gb,0,1));
}
}
$ret="";
for($i=0;$i
return $ret;
}
In the while loop part of the function, the Chinese characters are converted into UNICODE one by one according to the "lookup table", and then converted into UTF-8 through the u2utf8 function . But it can be seen from this that after the while loop ends, a for loop is used to synthesize every three bytes into a UTF-8 character (see the rule description at http://www.utf8.org/, each Chinese character The UTF-8 encoding is three bytes), without taking into account the Western characters (the UTF-8 encoding of Western characters is one byte). Therefore, if the content to be output contains Western characters at the beginning or Chinese characters interspersed with Western characters, after it is converted to UTF-8, it will be cut off according to the "every three bytes" method, resulting in Garbled characters.
The following is the modified function:
function gb2utf8($gb) /* Program writen by sadly modified by agun */
{
if(!trim($gb ))
return $gb;
$filename="gb2312.txt";
$tmp=file($filename);
$codetable=array();
while(list( $key,$value)=each($tmp))
$codetable[hexdec(substr($value,0,6))]=substr($value,7,6);
$ ret="";
$utf8="";
while($gb)
{
if (ord(substr($gb,0,1))>127)
{
$this=substr($gb,0,2);
$gb=substr($gb,2,strlen($gb));
$utf8=u2utf8(hexdec($codetable[ hexdec(bin2hex($this))-0x8080]));
for($i=0;$i
}
else
{
$ret.=substr($gb,0,1);
$gb=substr($gb,1 ,strlen($gb));
}
}
return $ret;
}
The modified function converts GB to UNICODE, UNICODE to UTF-8, Several bytes are combined into a UTF-8 character. These three steps are completed in a loop, especially the step of combining several bytes into a UTF-8 character, which is used to determine whether the character belongs to Western or Chinese characters. In the branch, it is decided to intercept one byte or three bytes accordingly. So the result is correct!