Table of Contents
简易分词演示
Home php教程 php手册 用PHP简易实现中文分词

用PHP简易实现中文分词

Jun 21, 2016 am 09:05 AM
char gt nbsp this

中文

hehe, 用PHP去做中文分词并不是一个太明智的举动, :p

下面是我根据网上找的一个字典档, 简易实现的一个分词程序.

(注: 字典档是gdbm格式, key是词 value是词频, 约4万个常用词)

完整的程序演示及下载请参见: http://root.twomice.net/my_php4/dict/chinese_segment.php

//中文分词系统简易实现办法
//切句单位:凡是ascii值//常见双字节符号:《》,。、?“”;:!¥…… %$#@^&*()[]{}|\/"'
//可以考虑加入超常见中文字: 的 和 是 不 了 啊 (不过有特殊字比如 "打的" "郑和" .. :p)

//计算时间
function getmicrotime(){
    list($usec, $sec) = explode(" ",microtime());
    return ((float)$usec + (float)$sec);
}
$time_start = getmicrotime();


//词典类
class ch_dictionary {
    var $_id;

    function ch_dictionary($fname = "") {
        if ($fname != "") {
            $this->load($fname);
        }
    }

    // 根据文件名载入字典 (gdbm数据档案)
    function load($fname) {
        $this->_id = dba_popen($fname, "r", "gdbm");
        if (!$this->_id) {
            echo "failed to open the dictionary.($fname)
\n";
            exit;
        }
    }

    // 根据词语返回频率, 不存在返回-1
    function find($word) {
        $freq = dba_fetch($word, $this->_id);
        if (is_bool($freq)) $freq = -1;
        return $freq;
    }
}

// 分词类: (逆向)
// 先将输入的字串正向切成句子, 然后一句一句的分词, 返回由词组成的数组.
class ch_word_split {
    var $_mb_mark_list;    // 常见切分句子的全角标点
    var $_word_maxlen;    // 单个词最大可能长度(汉字字数)
    var $_dic;        // 词典...
    var $_ignore_mark;    // true or false
   
    function ch_word_split () {
        $this->_mb_mark_list = array(","," ","。","!","?",":","……","、","“","”","《","》","(",")");
        $this->_word_maxlen  = 12;    // 12个汉字
        $this->_dic = NULL;
        $this->_ignore_mark = true;
    }

    // 设定字典
    function set_dic($fname) {
        $this->_dic = new ch_dictionary($fname);
    }

    function set_ignore_mark($set) {
        if (is_bool($set)) $this->_ignore_mark = $set;
    }

    // 将字串切成句子再加以切分成词
    function string_split($str, $func = "") {       
        $ret = array();
       
        if ($func == "" || !function_exists($func)) $func = "";       
       
        $len = strlen($str);
        $qtr = "";

        for ($i = 0; $i             $char = $str[$i];

            if (ord($char)                 // 读取到一个半角字符
                if (!empty($qtr)) {
                    $tmp = $this->_sen_split($qtr);
                    $qtr = "";

                    if ($func != "") call_user_func($func, $tmp);                   
                    else $ret = array_merge($ret, $tmp);                   
                }

                // 如果是单词或数字. 根据 char 将数据读取到 >= 0xa1为止
                if ($this->_is_alnum($char)) {
                    do {
                        if (($i+1) >= $len) break;
                        $char2 = substr($str, $i + 1, 1);
                        if (!$this->_is_alnum($char2)) break;

                        $char .= $char2;
                        $i++;
                    } while (1);

                    if ($func != "") call_user_func($func, array($char));
                    else $ret[] = $char;                   
                }
                elseif ($char == ' ' || $char == "\t") {
                    // nothing.
                    continue;
                }
                elseif (!$this->_ignore_mark) {
                    if ($func != "") call_user_func($func, array($char));
                    else $ret[] = $char;                   
                }
            }
            else {
                // 双字节字符.
                $i++;
                $char .= $str[$i];
               
                if (in_array($char, $this->_mb_mark_list)) {
                    if (!empty($qtr)) {
                        $tmp = $this->_sen_split($qtr);
                        $qtr = "";

                        if ($func != "") call_user_func($func, $tmp);
                        else $ret = array_merge($ret, $tmp);
                    }

                    if (!$this->_ignore_mark) {
                        if ($func != "") call_user_func($func, array($char));
                        else $ret[] = $char;
                    }
                }
                else {
                    $qtr .= $char;
                }
            }
        }
       
        if (strlen($qtr) > 0) {
            $tmp = $this->_sen_split($qtr);

            if ($func != "") call_user_func($func, $tmp);           
            else $ret = array_merge($ret, $tmp);           
        }

        // return value
        if ($func == "") {
            return $ret;
        }
        else {
            return true;
        }
    }

    // 将句子切成词, 逆向
    function _sen_split($sen) {
        $len = strlen($sen) / 2;
        $ret = array();

        for ($i = $len - 1; $i >= 0; $i--) {
            // 如: 这是一个分词程序
           
            // 先取得最后一个字
            $w = substr($sen, $i * 2, 2);

            // 最终的词长
            $wlen = 1;
           
            // 开始逆向匹配到最大长度.
            $lf = 0; // last freq
            for ($j = 1; $j _word_maxlen; $j++) {
                $o = $i - $j;
                if ($o                 $w2 = substr($sen, $o * 2, ($j + 1) * 2);
               
                $tmp_f = $this->_dic->find($w2);
                //echo "{$i}.{$j}: $w2 (f: $tmp_f)\n";
                if ($tmp_f > $lf) {
                    $lf = $tmp_f;
                    $wlen = $j + 1;
                    $w = $w2;
                }
            }
            // 根据 $wlen 将 $i 偏移了
            $i = $i - $wlen + 1;
            array_push($ret, $w);
        }

        $ret = array_reverse($ret);
        return $ret;
    }

    // 判断字符是不是 字母数字_- [0-9a-z_-]
    function _is_alnum($char) {
        $ord = ord($char);
        if ($ord == 45 || $ord == 95 || ($ord >= 48 && $ord             return true;
        if (($ord >= 97 && $ord = 65 && $ord             return true;
        return false;
    }
}


// 分词后的回调函数
function call_back($ar) {   
    foreach ($ar as $tmp) {
        echo $tmp . " ";
        //flush();
    }
}

// 实例(如果没有输入就从 sample.txt中读取):
$wp = new ch_word_split();
$wp->set_dic("dic.db");

if (!isset($_REQUEST['testdat']) || empty($_REQUEST['testdat'])) {
    $data = file_get_contents("sample.txt");
}
else {
    $data = & $_REQUEST['testdat'];
}

// output
echo "

简易分词演示

\n";
echo "
\n";
echo "分词结果(" . strlen($data) . " chars):
\n
\n本次分词耗时: $time seconds
\n";
?>



您也可以在下面文本框中输入文字,提交后试验分词效果:







附:

  • 本程序源码: chinese_segment.php (简易实现方式)

  • 需要的字典: dic.db (gdbm格式)

  •  


    附:
    (简易中文分词实现完整代码及字典下载)
    http://php.twomice.net/show_hdr.php?xname=BORRG11&dname=P7SRG11&xpos=19
    (C版简易中文分词服务程序(cscwsd))
    http://php.twomice.net/show_hdr.php?xname=BORRG11&dname=P7SRG11&xpos=40


     

     



    Statement of this Website
    The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

    Hot AI Tools

    Undresser.AI Undress

    Undresser.AI Undress

    AI-powered app for creating realistic nude photos

    AI Clothes Remover

    AI Clothes Remover

    Online AI tool for removing clothes from photos.

    Undress AI Tool

    Undress AI Tool

    Undress images for free

    Clothoff.io

    Clothoff.io

    AI clothes remover

    AI Hentai Generator

    AI Hentai Generator

    Generate AI Hentai for free.

    Hot Article

    R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
    2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
    Repo: How To Revive Teammates
    1 months ago By 尊渡假赌尊渡假赌尊渡假赌
    Hello Kitty Island Adventure: How To Get Giant Seeds
    4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

    Hot Tools

    Notepad++7.3.1

    Notepad++7.3.1

    Easy-to-use and free code editor

    SublimeText3 Chinese version

    SublimeText3 Chinese version

    Chinese version, very easy to use

    Zend Studio 13.0.1

    Zend Studio 13.0.1

    Powerful PHP integrated development environment

    Dreamweaver CS6

    Dreamweaver CS6

    Visual web development tools

    SublimeText3 Mac version

    SublimeText3 Mac version

    God-level code editing software (SublimeText3)

    Solution: Your organization requires you to change your PIN Solution: Your organization requires you to change your PIN Oct 04, 2023 pm 05:45 PM

    The message "Your organization has asked you to change your PIN" will appear on the login screen. This happens when the PIN expiration limit is reached on a computer using organization-based account settings, where they have control over personal devices. However, if you set up Windows using a personal account, the error message should ideally not appear. Although this is not always the case. Most users who encounter errors report using their personal accounts. Why does my organization ask me to change my PIN on Windows 11? It's possible that your account is associated with an organization, and your primary approach should be to verify this. Contacting your domain administrator can help! Additionally, misconfigured local policy settings or incorrect registry keys can cause errors. Right now

    How to adjust window border settings on Windows 11: Change color and size How to adjust window border settings on Windows 11: Change color and size Sep 22, 2023 am 11:37 AM

    Windows 11 brings fresh and elegant design to the forefront; the modern interface allows you to personalize and change the finest details, such as window borders. In this guide, we'll discuss step-by-step instructions to help you create an environment that reflects your style in the Windows operating system. How to change window border settings? Press + to open the Settings app. WindowsI go to Personalization and click Color Settings. Color Change Window Borders Settings Window 11" Width="643" Height="500" > Find the Show accent color on title bar and window borders option, and toggle the switch next to it. To display accent colors on the Start menu and taskbar To display the theme color on the Start menu and taskbar, turn on Show theme on the Start menu and taskbar

    How to change title bar color on Windows 11? How to change title bar color on Windows 11? Sep 14, 2023 pm 03:33 PM

    By default, the title bar color on Windows 11 depends on the dark/light theme you choose. However, you can change it to any color you want. In this guide, we'll discuss step-by-step instructions for three ways to change it and personalize your desktop experience to make it visually appealing. Is it possible to change the title bar color of active and inactive windows? Yes, you can change the title bar color of active windows using the Settings app, or you can change the title bar color of inactive windows using Registry Editor. To learn these steps, go to the next section. How to change title bar color in Windows 11? 1. Using the Settings app press + to open the settings window. WindowsI go to "Personalization" and then

    OOBELANGUAGE Error Problems in Windows 11/10 Repair OOBELANGUAGE Error Problems in Windows 11/10 Repair Jul 16, 2023 pm 03:29 PM

    Do you see "A problem occurred" along with the "OOBELANGUAGE" statement on the Windows Installer page? The installation of Windows sometimes stops due to such errors. OOBE means out-of-the-box experience. As the error message indicates, this is an issue related to OOBE language selection. There is nothing to worry about, you can solve this problem with nifty registry editing from the OOBE screen itself. Quick Fix – 1. Click the “Retry” button at the bottom of the OOBE app. This will continue the process without further hiccups. 2. Use the power button to force shut down the system. After the system restarts, OOBE should continue. 3. Disconnect the system from the Internet. Complete all aspects of OOBE in offline mode

    How to enable or disable taskbar thumbnail previews on Windows 11 How to enable or disable taskbar thumbnail previews on Windows 11 Sep 15, 2023 pm 03:57 PM

    Taskbar thumbnails can be fun, but they can also be distracting or annoying. Considering how often you hover over this area, you may have inadvertently closed important windows a few times. Another disadvantage is that it uses more system resources, so if you've been looking for a way to be more resource efficient, we'll show you how to disable it. However, if your hardware specs can handle it and you like the preview, you can enable it. How to enable taskbar thumbnail preview in Windows 11? 1. Using the Settings app tap the key and click Settings. Windows click System and select About. Click Advanced system settings. Navigate to the Advanced tab and select Settings under Performance. Select "Visual Effects"

    What are the differences between Huawei GT3 Pro and GT4? What are the differences between Huawei GT3 Pro and GT4? Dec 29, 2023 pm 02:27 PM

    Many users will choose the Huawei brand when choosing smart watches. Among them, Huawei GT3pro and GT4 are very popular choices. Many users are curious about the difference between Huawei GT3pro and GT4. Let’s introduce the two to you. . What are the differences between Huawei GT3pro and GT4? 1. Appearance GT4: 46mm and 41mm, the material is glass mirror + stainless steel body + high-resolution fiber back shell. GT3pro: 46.6mm and 42.9mm, the material is sapphire glass + titanium body/ceramic body + ceramic back shell 2. Healthy GT4: Using the latest Huawei Truseen5.5+ algorithm, the results will be more accurate. GT3pro: Added ECG electrocardiogram and blood vessel and safety

    Display scaling guide on Windows 11 Display scaling guide on Windows 11 Sep 19, 2023 pm 06:45 PM

    We all have different preferences when it comes to display scaling on Windows 11. Some people like big icons, some like small icons. However, we all agree that having the right scaling is important. Poor font scaling or over-scaling of images can be a real productivity killer when working, so you need to know how to customize it to get the most out of your system's capabilities. Advantages of Custom Zoom: This is a useful feature for people who have difficulty reading text on the screen. It helps you see more on the screen at one time. You can create custom extension profiles that apply only to certain monitors and applications. Can help improve the performance of low-end hardware. It gives you more control over what's on your screen. How to use Windows 11

    10 Ways to Adjust Brightness on Windows 11 10 Ways to Adjust Brightness on Windows 11 Dec 18, 2023 pm 02:21 PM

    Screen brightness is an integral part of using modern computing devices, especially when you look at the screen for long periods of time. It helps you reduce eye strain, improve legibility, and view content easily and efficiently. However, depending on your settings, it can sometimes be difficult to manage brightness, especially on Windows 11 with the new UI changes. If you're having trouble adjusting brightness, here are all the ways to manage brightness on Windows 11. How to Change Brightness on Windows 11 [10 Ways Explained] Single monitor users can use the following methods to adjust brightness on Windows 11. This includes desktop systems using a single monitor as well as laptops. let's start. Method 1: Use the Action Center The Action Center is accessible

    See all articles