Home > Backend Development > PHP Tutorial > Function in PHP to detect whether a file is UTF-8 encoded

Function in PHP to detect whether a file is UTF-8 encoded

WBOY
Release: 2016-07-28 08:25:47
Original
1102 people have browsed it
// 
// 测试文本是否是utf8编码
// 
// 返回值:
//   1 - 有BOM头的内容
//   2 - 纯utf8的内容
//   3 - 较可能是utf8的内容
//   4 - 较不可能是utf8的内容
// 
function utf8_check($text)
{
  $utf8_bom = chr(0xEF).chr(0xBB).chr(0xBF);
  
  // BOM头检查
  if (strstr($text, $utf8_bom) === 0)
    return 1;
  
  $text_len = strlen($text);
  
  // UTF-8是一种变长字节编码方式。对于某一个字符的UTF-8编码,如果只有一个字节则其最高二进制位为0;
  // 如果是多字节,其第一个字节从最高位开始,连续的二进制位值为1的个数决定了其编码的位数,其余各字节均以10开头。
  // UTF-8最多可用到6个字节。
  //
  // 如表:
  // < 0x80 1字节 0xxxxxxx
  // < 0xE0 2字节 110xxxxx 10xxxxxx
  // < 0xF0 3字节 1110xxxx 10xxxxxx 10xxxxxx
  // < 0xF8 4字节 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
  // < 0xFC 5字节 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  // < 0xFE 6字节 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  
  $bad   = 0; // 不符合utf8规范的字符数
  $good  = 0; // 符号utf8规范的字符数
  
  $need_check = 0; // 遇到多字节的utf8字符后,需要检查的连续字节数
  $have_check = 0; // 已经检查过的连续字节数
  
  for ($i = 0; $i < $text_len; $i &#43;&#43;) {
    $c = ord($text[$i]);

    if ($need_check > 0) {
      $c = ord($text[$i]);
      $c = ($c >> 6) << 6;
      
      $have_check &#43;&#43;;
      
      // 10xxxxxx ~ 10111111
      if ($c != 0x80) {
        $i -= $have_check;
        $need_check = 0;
        $have_check = 0;
        $bad &#43;&#43;;
      }
      else if ($need_check == $have_check) {
        $need_check = 0;
        $have_check = 0;
        $good &#43;&#43;;
      }
      
      continue;
    }
    
    if ($c < 0x80)      // 0xxxxxxx
      $good &#43;&#43;;
    else if ($c < 0xE0) // 110xxxxx
      $need_check = 1;
    else if ($c < 0xF0) // 1110xxxx
      $need_check = 2;
    else if ($c < 0xF8) // 11110xxx
      $need_check = 3;
    else if ($c < 0xFC) // 111110xx
      $need_check = 4;
    else if ($c < 0xFE) // 1111110x
      $need_check = 5;
    else
      $bad &#43;&#43;;
  }
  
  if ($bad == 0)
    return 2;
  else if ($good > $bad)
    return 3;
  else
    return 4;
}
Copy after login

The above introduces the function in PHP to detect whether a file is UTF-8 encoded, including the relevant content. I hope it will be helpful to friends who are interested in PHP tutorials.

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template