How to intercept Chinese string in php_PHP tutorial-PHP Tutorial-php.cn

The easiest way to intercept a string in php is to use the substr() function. However, the substr function can only intercept English. If it is Chinese, it will not be garbled. Then some friends said that you can use mb_substr() to intercept. , this method cannot intercept mixed Chinese and English characters.

This function is used to intercept gb2312 encoded Chinese string:

The code is as follows

Copy code

代码如下	复制代码
// 说明：截取中文字符串 function mysubstr($str, $start, $len) { $tmpstr = ""; $strlen = $start + $len; for($i = 0; $i < $strlen; $i++) { if(ord(substr($str, $i, 1)) > 0xa0) { $tmpstr .= substr($str, $i, 2); $i++; } else $tmpstr .= substr($str, $i, 1); } return $tmpstr; } ?>

// Description: intercept Chinese string

function mysubstr($str, $start, $len) {
$tmpstr = "";
$strlen = $start + $len;
for($i = 0; $i < $strlen; $i++) {
If(ord(substr($str, $i, 1)) > 0xa0) {
                 $tmpstr .= substr($str, $i, 2);
               $i++;
          } else
                 $tmpstr .= substr($str, $i, 1);
}
Return $tmpstr;
}
?>

Chinese character interception function supported by Utf-8 and gb2312

Interception utf-8 string function

In order to support multiple languages, the strings in the database may be saved as UTF-8 encoding. During website development, you may need to use PHP to intercept part of the string. In order to avoid garbled characters, write the following UTF-8 string interception function

For the principles of utf-8, please see UTF-8 FAQ

UTF-8 encoded characters may consist of 1~3 bytes, and the specific number can be determined from the first byte. (Theoretically it may be longer, but here we assume no more than 3 bytes)
If the first byte is greater than 224, it and the following 2 bytes form a UTF-8 character
If the first byte is greater than 192 and less than 224, it and the 1 byte after it form a UTF-8 character
Otherwise the first byte itself is an English character (including numbers and a small amount of punctuation).

The code is as follows

Copy code

代码如下

复制代码

// 说明：Utf-8、gb2312都支持的汉字截取函数

/*
Utf-8、gb2312都支持的汉字截取函数
cut_str(字符串, 截取长度, 开始长度, 编码);
编码默认为 utf-8
开始长度默认为 0
*/

function cut_str($string, $sublen, $start = 0, $code = 'UTF-8')
{
if($code == 'UTF-8')
{
$pa = "/[x01-x7f]|[xc2-xdf][x80-xbf]|xe0[xa0-xbf][x80-xbf]|[xe1-xef][x80-xbf][x80-xbf]|xf0[x90-xbf][x80-xbf][x80-xbf]|[xf1-xf7][x80-xbf][x80-xbf][x80-xbf]/";
preg_match_all($pa, $string, $t_string);

if(count($t_string[0]) - $start > $sublen) return join('', array_slice($t_string[0], $start, $sublen))."...";
        return join('', array_slice($t_string[0], $start, $sublen));
    }
    else
    {
        $start = $start*2;
        $sublen = $sublen*2;
        $strlen = strlen($string);
        $tmpstr = '';

        for($i=0; $i<$strlen; $i++)
{
if($i>=$start && $i<($start+$sublen))
{
if(ord(substr($string, $i, 1))>129)
                {
                    $tmpstr.= substr($string, $i, 2);
                }
                else
                {
                    $tmpstr.= substr($string, $i, 1);
                }
            }
            if(ord(substr($string, $i, 1))>129) $i++;
        }
        if(strlen($tmpstr)<$strlen ) $tmpstr.= "...";
return $tmpstr;
}
}

$str = "abcd需要截取的字符串";
echo cut_str($str, 8, 0, 'gb2312');
?>

// Description: Chinese character interception function supported by Utf-8 and gb2312 <🎜> <🎜> /* <🎜> Chinese character interception function supported by Utf-8 and gb2312 <🎜> cut_str(string, cut length, starting length, encoding); <🎜> The encoding defaults to utf-8 <🎜> Start length defaults to 0 <🎜> */<🎜> <🎜> function cut_str($string, $sublen, $start = 0, $code = 'UTF-8') <🎜> { <🎜> If($code == 'UTF-8') <🎜> { <🎜> $pa = "/[x01-x7f]|[xc2-xdf][x80-xbf]|xe0[xa0-xbf][x80-xbf]|[xe1-xef][x80-xbf][x80-xbf]| xf0[x90-xbf][x80-xbf][x80-xbf]|[xf1-xf7][x80-xbf][x80-xbf][x80-xbf]/"; <🎜> Preg_match_all($pa, $string, $t_string); <🎜> <🎜> If(count($t_string[0]) - $start > $sublen) return join('', array_slice($t_string[0], $start, $sublen))."..."; return join('', array_slice($t_string[0], $start, $sublen)); } else { $start = $start*2; $sublen = $sublen*2; $strlen = strlen($string); $tmpstr = ''; for($i=0; $i<$strlen; $i++) <🎜> { <🎜> If($i>=$start && $i<($start+$sublen)) <🎜> If(ord(substr($string, $i, 1))>129) $tmpstr.= substr($string, $i, 2); Else $tmpstr.= substr($string, $i, 1); If(ord(substr($string, $i, 1))>129) $i++; } If(strlen($tmpstr)<$strlen ) $tmpstr.= "..."; <🎜> return $tmpstr; <🎜> } <🎜> } <🎜> <🎜> $str = "The string that abcd needs to intercept"; <🎜> echo cut_str($str, 8, 0, 'gb2312'); <🎜> ?>

Note:

3. Especially suitable for strings encoded with htmlspecialchars()

The code is as follows

代码如下	复制代码
function utf8Substr($str, $from, $len) { return preg_replace('#^(?:[x00-x7F]\|[xC0-xFF][x80-xBF]+){0,'.$from.'}'. '((?:[x00-x7F]\|[xC0-xFF][x80-xBF]+){0,'.$len.'}).*#s', '',$str); }

Copy code

function utf8Substr($str, $from, $len)
{
Return preg_replace('#^(?:[x00-x7F]|[xC0-xFF][x80-xBF]+){0,'.$from.'}'.
'((?:[x00-x7F]|[xC0-xFF][x80-xBF]+){0,'.$len.'}).*#s',
'$1',$str);
}

Uft8 strings can be intercepted individually.

Program description:

1. The len parameter is based on Chinese characters. 1len is equal to 2 English characters. In order to make the form more beautiful

2. If the magic parameter is set to false, Chinese and English will be treated equally, and the absolute number of characters will be taken

代码如下

复制代码

function FSubstr($title,$start,$len="",$magic=true)
{
/**
* powered by Smartpig
* mailto:d.einstein@263.net
*/

$length = 0;
if($len == "") $len = strlen($title);

//判断起始为不正确位置
if($start > 0)
{
$cnum = 0;
for($i=0;$i<$start;$i++)
{
if(ord(substr($title,$i,1)) >= 128) $cnum ++;
}
if($cnum%2 != 0) $start--;

unset($cnum);
}

if(strlen($title)<=$len) return substr($title,$start,$len);

$alen = 0;
$blen = 0;

$realnum = 0;

for($i=$start;$i {
$ctype = 0;
$cstep = 0;
$cur = substr($title,$i,1);
if($cur == "&")
{
if(substr($title,$i,4) == "<")
{
$cstep = 4;
$length += 4;
$i += 3;
$realnum ++;
if($magic)
{
$alen ++;
}
}
else if(substr($title,$i,4) == ">")
{
$cstep = 4;
$length += 4;
$i += 3;
$realnum ++;
if($magic)
{
$alen ++;
}
}
else if(substr($title,$i,5) == "&")
{
$cstep = 5;
$length += 5;
$i += 4;
$realnum ++;
if($magic)
{
$alen ++;
}
}
else if(substr($title,$i,6) == """)
{
$cstep = 6;
$length += 6;
$i += 5;
$realnum ++;
if($magic)
{
$alen ++;
}
}
else if(substr($title,$i,6) == "'")
{
$cstep = 6;
$length += 6;
$i += 5;
$realnum ++;
if($magic)
{
$alen ++;
}
}
else if(preg_match("/&#(d+);/i",substr($title,$i,8),$match))
{
$cstep = strlen($match[0]);
$length += strlen($match[0]);
$i += strlen($match[0])-1;
$realnum ++;
if($magic)
{
$blen ++;
$ctype = 1;
}
}
}else{
if(ord($cur)>=128)
{
$cstep = 2;
$length += 2;
$i += 1;
$realnum ++;
if($magic)
{
$blen ++;
$ctype = 1;
}
}else{
$cstep = 1;
$length +=1;
$realnum ++;
if($magic)
{
$alen++;
}
}
}

if($magic)
{
if(($blen*2+$alen) == ($len*2)) break;
if(($blen*2+$alen) == ($len*2+1))
{
if($ctype == 1)
{
$length -= $cstep;
break;
}else{
break;
}
}
}else{
if($realnum == $len) break;
}
}

unset($cur);
unset($alen);
unset($blen);
unset($realnum);
unset($ctype);
unset($cstep);

return substr($title,$start,$length);
}

4. Can correctly handle the entity character mode (𖰰) in GB2312 Program code:

The code is as follows

Copy code