> php教程 > php手册 > 手把手教你做关键词匹配项目(搜索引擎)---- 第二十一天,教你做第二十一天

手把手教你做关键词匹配项目(搜索引擎)---- 第二十一天,教你做第二十一天

WBOY
풀어 주다: 2016-06-13 09:25:49
원래의
1112명이 탐색했습니다.

手把手教你做关键词匹配项目(搜索引擎)---- 第二十一天,教你做第二十一天

客串:屌丝的坑人表单神器、数据库那点事儿

面向对象升华:面向对象的认识----新生的初识、面向对象的番外----思想的梦游篇(1)、面向对象的认识---如何找出类

负载均衡:负载均衡----概念认识篇、负载均衡----实现配置篇(Nginx)

 

吐槽:现在欠的文章有面向对象的认识----类的转化、面向对象的番外---思想的梦游篇(2)、负载均衡 ---- 文件服务策略、手把手教你做关键词匹配项目(搜索引擎)。真心太多了,能不能让我休息一会儿。

 

第二十一天

起点:手把手教你做关键词匹配项目(搜索引擎)---- 第一天

回顾:手把手教你做关键词匹配项目(搜索引擎)---- 第二十天

今天有个理论知识要理解的,叫做测试驱动编程,之前我提到过概念,在:手把手教你做关键词匹配项目(搜索引擎)---- 第十一天 

今天小帅帅秀逗了一回,使用了这个思想。

好了,以下正文开始。

 

话说小帅帅把自己写的业务拆词方法给了于老大看,于老大很高兴。

但是业务拆词的词组都是有限的,还有就是当业务拆词的数据量越来越大的时候,就会造成运算时间增加。

于老大就提到,是否可以用其它分词扩展来弥补拆词的不足。

毕竟人家专业人士做的,比较靠谱点。

于老大很有经验,就推荐小帅帅去了解SCWS的用法.

SCWS 是 Simple Chinese Word Segmentation 的首字母缩写(即:简易中文分词系统)。
官方网址:http://www.xunsearch.com/scws/index.php

小帅帅听了当然很开心罗,因为又有新的知识点了。

小帅帅照着SCWS的安装文档安装了SCWS。

并把php扩展装好了,并尝试写了个测试代码:

<?<span>php
</span><span>class</span><span> TestSCWS {

    </span><span>public</span> <span>static</span> <span>function</span> <span>split</span>(<span>$keyword</span><span>){

        </span><span>if</span> (!<span>extension_loaded</span>("scws"<span>)) {
            </span><span>throw</span> <span>new</span> <span>Exception</span>("scws extension load fail"<span>);
        }

        </span><span>$so</span> =<span> scws_new();
        </span><span>$so</span>->set_charset('utf8'<span>);

        </span><span>$so</span>->send_text(<span>$keyword</span><span>);
        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>while</span> (<span>$res</span> = <span>$so</span>-><span>get_result()) {
            </span><span>foreach</span> (<span>$res</span> <span>as</span> <span>$tmp</span><span>) {
                </span><span>if</span> (self::isValidate(<span>$tmp</span><span>)) {
                    </span><span>$ret</span>[] = <span>$tmp</span><span>;
                }
            }
        }
        </span><span>$so</span>-><span>close();
        </span><span>return</span> <span>$ret</span><span>;
    }

    </span><span>public</span> <span>static</span> <span>function</span> isValidate(<span>$scws_words</span><span>)
    {
        </span><span>if</span> (<span>$scws_words</span>['len'] == 1 && (<span>$scws_words</span>['word'] == "\r" || <span>$scws_words</span>['word'] == "\n"<span>)) {
            </span><span>return</span> <span>false</span><span>;
        }
        </span><span>return</span> <span>true</span><span>;
    }

}


</span><span>var_dump</span>(TestSCWS::<span>split</span>("连衣裙xxl裙连衣裙"));
로그인 후 복사

测试通过,跟理想中的一摸一样,小帅帅很高兴,就去问于老大:于老大我会用SCWS了,下一步该怎么办?

于老大也不慌,就对小帅帅说: 你先写个ScwsSplitter来拆分关键词吧。

小帅帅非常高兴,因为他学到了新的知识,就对于老大说到好的。

小帅帅说到做到,代码如下:

<span>class</span><span> ScwsSplitter {

    </span><span>public</span> <span>$keyword</span><span>;
    
    </span><span>public</span> <span>function</span> <span>split</span><span>(){

        </span><span>if</span> (!<span>extension_loaded</span>("scws"<span>)) {
            </span><span>throw</span> <span>new</span> <span>Exception</span>("scws extension load fail"<span>);
        }

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$this</span>-><span>keyword);

        </span><span>$so</span> =<span> scws_new();
        </span><span>$so</span>->set_charset('utf8'<span>);

        </span><span>$so</span>->send_text(<span>$this</span>-><span>keyword);
       
        </span><span>while</span> (<span>$res</span> = <span>$so</span>-><span>get_result()) {
            </span><span>foreach</span> (<span>$res</span> <span>as</span> <span>$tmp</span><span>) {
                </span><span>if</span> (<span>$this</span>->isValidate(<span>$tmp</span><span>)) {
                    </span><span>$keywordEntity</span>->addElement(<span>$tmp</span>["word"<span>]);
                }
            }
        }
        </span><span>$so</span>-><span>close();
        </span><span>return</span> <span>$keywordEntity</span><span>;
    }

    </span><span>public</span> <span>function</span> isValidate(<span>$scws_words</span><span>)
    {
        </span><span>if</span> (<span>$scws_words</span>['len'] == 1 && (<span>$scws_words</span>['word'] == "\r" || <span>$scws_words</span>['word'] == "\n"<span>)) {
            </span><span>return</span> <span>false</span><span>;
        }
        </span><span>return</span> <span>true</span><span>;
    }
    
}</span>
로그인 후 복사

小帅帅又跑去找于老大,说到:我把Scws的分词代码写好了。

于老大也佩服小帅帅的高效率。

又说到:如果我两个同时用了,我先用业务分词,遗留下来的词用Scws分词,小帅帅有好的方案吗?

小帅帅就问到: 为啥要这样,这不是多此一举。

于老大就说到:业务有些专有名词,SCWS分不出来丫,那怎么办好?

小帅帅又说到:我看文档的时候看到有词库和规则文件的设置,我们用它好不好?

于老大又说到:这个是可以,但是我们如何保证让运营人员维护,我们要学会把这些事情交出去丫。

小帅帅: …….

小帅帅沉默了片刻,觉得现在两个类都写了,一起用是最快的方案,就答应到:好吧,我回去改改….

首先小帅帅根据测试驱动编程的思想写了入口代码:

<span>class</span><span> SplitterApp {

    </span><span>public</span> <span>static</span> <span>function</span> <span>split</span>(<span>$keyword</span>,<span>$cid</span><span>){

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$keyword</span><span>);

        </span><span>#</span><span>业务分词</span>
        <span>$termSplitter</span> = <span>new</span> TermSplitter(<span>$keywordEntity</span><span>);
        </span><span>$seg</span> = <span>new</span><span> DBSegmentation();
        </span><span>$seg</span>->cid = <span>$cid</span><span>;
        </span><span>$termSplitter</span>->setDictionary(<span>$seg</span>-><span>transferDictionary());
        </span><span>$termSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>SCWS分词</span>
        <span>$scwsSplitter</span> = <span>new</span> ScwsSplitter(<span>$keywordEntity</span><span>);
        </span><span>$scwsSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>后续遗留单词或者词组处理</span>
        <span>$elementWords</span> = <span>$keywordEntity</span>-><span>getElementWords();
        </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$elementWords</span>, "::", <span>$keywordEntity</span>-><span>keyword);
        </span><span>$remainElements</span> = <span>explode</span>("::", <span>$remainKeyword</span><span>);
        </span><span>foreach</span>(<span>$remainElements</span> <span>as</span> <span>$element</span><span>){
            </span><span>if</span>(!<span>empty</span>(<span>$element</span><span>))
                </span><span>$keywordEntity</span>->addElement(<span>$element</span><span>);
        }<br />     return $keywordEntity;
    }
}</span>
로그인 후 복사

小帅帅嘿了一声,有了测试入口,还怕其他的搞不定。

首先KeywordEntity的getElementWords,先搞定他.

<span>class</span><span> KeywordEntity
{

    </span><span>public</span> <span>$keyword</span><span>;
    </span><span>public</span> <span>$elements</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> __construct(<span>$keyword</span><span>)
    {
        </span><span>$this</span>->keyword = <span>$keyword</span><span>;
    }

    </span><span>public</span> <span>function</span> addElement(<span>$word</span>, <span>$times</span> = 1<span>)
    {

        </span><span>if</span> (<span>isset</span>(<span>$this</span>->elements[<span>$word</span><span>])) {
            </span><span>$this</span>->elements[<span>$word</span>]->times += <span>$times</span><span>;
        } </span><span>else</span>
            <span>$this</span>->elements[<span>$word</span>] = <span>new</span> KeywordElement(<span>$word</span>, <span>$times</span><span>);
    }

    </span><span>public</span> <span>function</span><span> getElementWords()
    {
        </span><span>$elementWords</span> = <span>array_keys</span>(<span>$this</span>-><span>elements);
        </span><span>usort</span>(<span>$elementWords</span>, <span>function</span> (<span>$a</span>, <span>$b</span><span>) {
            </span><span>return</span> (UTF8::length(<span>$a</span>) < UTF8::length(<span>$b</span>)) ? 1 : -1<span>;
        });
        </span><span>return</span> <span>$elementWords</span><span>;
    }

    </span><span>/*</span><span>*
     * @desc 计算UTF8字符串权重
     * @param string $word
     * @return float
     </span><span>*/</span>
    <span>public</span> <span>function</span> calculateWeight(<span>$word</span><span>)
    {
        </span><span>$element</span> = <span>$this</span>->elements[<span>$word</span><span>];
        </span><span>return</span> <span>ROUND</span>(<span>strlen</span>(<span>$element</span>->word) * <span>$element</span>->times / <span>strlen</span>(<span>$this</span>->keyword), 3<span>);
    }
}

</span><span>class</span><span> KeywordElement
{
    </span><span>public</span> <span>$word</span><span>;
    </span><span>public</span> <span>$times</span><span>;

    </span><span>public</span> <span>function</span> __construct(<span>$word</span>, <span>$times</span><span>)
    {
        </span><span>$this</span>->word = <span>$word</span><span>;
        </span><span>$this</span>->times = <span>$times</span><span>;
    }
}</span>
로그인 후 복사

其次就是分词了,首先先抽出公用类先,Splitter变成了公用类,有哪些方法呢?

  1. 抽象split方法

2. 获取关键词待拆分的词组

3. 是否需要拆分

按照这写,小帅帅写出了以下代码:

<span>abstract</span> <span>class</span><span> Splitter {

    </span><span>/*</span><span>*
     * @var KeywordEntity $keywordEntity
     </span><span>*/</span>
    <span>public</span> <span>$keywordEntity</span><span>;

    </span><span>public</span> <span>function</span> __construct(<span>$keywordEntity</span><span>){
        </span><span>$this</span>->keywordEntity = <span>$keywordEntity</span><span>;
    }

    </span><span>public</span> <span>abstract</span> <span>function</span> <span>split</span><span>();


    </span><span>/*</span><span>*
     * 获取未分割的字符串,过滤单词
     *
     * @return array
     </span><span>*/</span>
    <span>public</span> <span>function</span><span> getRemainKeywords()
    {
        </span><span>$elementWords</span> = <span>$this</span>->keywordEntity-><span>getElementWords();

        </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$elementWords</span>, "::", <span>$this</span>->keywordEntity-><span>keyword);
        </span><span>$remainElements</span> = <span>explode</span>("::", <span>$remainKeyword</span><span>);
        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>foreach</span> (<span>$remainElements</span> <span>as</span> <span>$element</span><span>) {
            </span><span>if</span> (<span>$this</span>->isSplit(<span>$element</span><span>)) {
                </span><span>$ret</span>[] = <span>$element</span><span>;
            }
        }
        </span><span>return</span> <span>$ret</span><span>;
    }

    </span><span>/*</span><span>*
     * 是否需要拆分
     *
     * @param $element
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>function</span> isSplit(<span>$element</span><span>)
    {
        </span><span>if</span> (UTF8::isPhrase(<span>$element</span><span>)) {
            </span><span>return</span> <span>true</span><span>;
        }

        </span><span>return</span> <span>false</span><span>;
    }
}</span>
로그인 후 복사

然后小帅帅继续实现业务拆分算法,以及Scws拆分算法。小帅帅淫笑了,这点小事情还是可以办到的。

<span>class</span> TermSplitter <span>extends</span><span> Splitter {

    </span><span>private</span> <span>$dictionary</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> setDictionary(<span>$dictionary</span> = <span>array</span><span>())
    {
        </span><span>usort</span>(<span>$dictionary</span>, <span>function</span> (<span>$a</span>, <span>$b</span><span>) {
            </span><span>return</span> (UTF8::length(<span>$a</span>) < UTF8::length(<span>$b</span>)) ? 1 : -1<span>;
        });

        </span><span>$this</span>->dictionary = <span>$dictionary</span><span>;
    }

    </span><span>public</span> <span>function</span><span> getDictionary()
    {
        </span><span>return</span> <span>$this</span>-><span>dictionary;
    }

    </span><span>/*</span><span>*
     * 把关键词拆分成词组或者单词
     *
     * @return KeywordScore[] $keywordScores
     </span><span>*/</span>
    <span>public</span> <span>function</span> <span>split</span><span>()
    {
        </span><span>foreach</span> (<span>$this</span>->dictionary <span>as</span> <span>$phrase</span><span>) {
            </span><span>$remainKeyword</span> = <span>implode</span>("::",<span>$this</span>-><span>getRemainKeywords());
            </span><span>$matchTimes</span> = <span>preg_match_all</span>("/<span>$phrase</span>/", <span>$remainKeyword</span>, <span>$matches</span><span>);
            </span><span>if</span> (<span>$matchTimes</span> > 0<span>) {
                </span><span>$this</span>->keywordEntity->addElement(<span>$phrase</span>, <span>$matchTimes</span><span>);
            }
        }
    }
}


</span><span>class</span> ScwsSplitter <span>extends</span><span> Splitter
{
    </span><span>public</span> <span>function</span> <span>split</span><span>()
    {
        </span><span>if</span> (!<span>extension_loaded</span>("scws"<span>)) {
            </span><span>throw</span> <span>new</span> <span>Exception</span>("scws extension load fail"<span>);
        }

        </span><span>$remainElements</span> = <span>$this</span>-><span>getRemainKeywords();
        </span><span>foreach</span> (<span>$remainElements</span> <span>as</span> <span>$element</span><span>) {

            </span><span>$so</span> =<span> scws_new();
            </span><span>$so</span>->set_charset('utf8'<span>);
            </span><span>$so</span>->send_text(<span>$element</span><span>);
            </span><span>while</span> (<span>$res</span> = <span>$so</span>-><span>get_result()) {
                </span><span>foreach</span> (<span>$res</span> <span>as</span> <span>$tmp</span><span>) {
                    </span><span>if</span> (<span>$this</span>->isValidate(<span>$tmp</span><span>)) {
                        </span><span>$this</span>->keywordEntity->addElement(<span>$tmp</span>['word'<span>]);
                    }
                }
            }
            </span><span>$so</span>-><span>close();
        }
    }

    </span><span>/*</span><span>*
     * @param array $scws_words
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>function</span> isValidate(<span>$scws_words</span><span>)
    {
        </span><span>if</span> (<span>$scws_words</span>['len'] == 1 && (<span>$scws_words</span>['word'] == "\r" || <span>$scws_words</span>['word'] == "\n"<span>)) {
            </span><span>return</span> <span>false</span><span>;
        }
        </span><span>return</span> <span>true</span><span>;
    }

}</span>
로그인 후 복사

小帅帅终于把这些代码全部搞定了,高兴之余,他还顺手画了UML图送给大家:

小帅帅的成长真心够厉害的哦,于老大看后,连称赞了三次。

为了测试,小帅帅写了测试代码,代码如下:

<span>class</span><span> SplitterAppTest {

    </span><span>public</span> <span>static</span> <span>function</span> <span>split</span>(<span>$keyword</span><span>){

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$keyword</span><span>);

        </span><span>#</span><span>业务分词</span>
        <span>$termSplitter</span> = <span>new</span> TermSplitter(<span>$keywordEntity</span><span>);
        </span><span>$seg</span> = <span>new</span><span> TestSegmentation();
        </span><span>$termSplitter</span>->setDictionary(<span>$seg</span>-><span>transferDictionary());
        </span><span>$termSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>SCWS分词</span>
        <span>$scwsSplitter</span> = <span>new</span> ScwsSplitter(<span>$keywordEntity</span><span>);
        </span><span>$scwsSplitter</span>-><span>split</span><span>();

        </span><span>#</span><span>后续遗留单词或者词组处理</span>
        <span>$elementWords</span> = <span>$keywordEntity</span>-><span>getElementWords();
        </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$elementWords</span>, "::", <span>$keywordEntity</span>-><span>keyword);
        </span><span>$remainElements</span> = <span>explode</span>("::", <span>$remainKeyword</span><span>);
        </span><span>foreach</span>(<span>$remainElements</span> <span>as</span> <span>$element</span><span>){
            </span><span>if</span>(!<span>empty</span>(<span>$element</span><span>))
                </span><span>$keywordEntity</span>->addElement(<span>$element</span><span>);
        }
       </span><span>return</span> <span>$keywordEntity</span><span>;
    }
}


SplitterAppTest</span>::<span>split</span>("连衣裙xl裙宽衣裙");
로그인 후 복사

小帅帅意淫着,想到总有一天把你们踩在脚下。

 

手把手教你做,很适合上班族与学生想发大财的就不要来了,赚个话费

  每个人一生中都拥有一副好牌,可惜的是许多人都把它浪费了,手上握有一副富人的牌,却把自己打成了一个穷人。
  许多人心灵上都沾满了消极的灰尘,失望的污泥和贫穷落后的思想,甚至还怨恨的种子,这样你就永远不会快乐和富有的。穷人:致富和做生意到底有没有什么秘诀?
  富人:每件事情都 有它不同的内在规律,所谓的秘诀实际上就只是那么一点点东西。
  九十九度加一度,水就开了。开水与温水的区别是这么一度。有些事情之所以会有天壤之别,往往就是因为这微不足道的一度。我在报上看到这么一件事。
  两个下岗女工,各在路边开了一个早点铺,都卖包子和油茶。一个生意逐渐兴旺,一个30天后收了摊,据说原因是一个鸡蛋的问题。
  生意逐渐兴旺的那家,每当顾客到来时,总是问在油茶里打一个鸡蛋还是两个鸡蛋;垮掉的那一家问的是要不要。两种不同的问法总能使第一家卖出较多的鸡蛋。鸡蛋卖出得多,盈利就大,就付得起各项费用,生意也就做了下去。鸡蛋卖得少的,盈利少,去掉费用不赚钱,摊子只好收起。成功与失败之间仅一个鸡蛋的区别。
  名满天下的可口可乐中,百分之九十九的是水、糖、碳酸和*,世界上一切饮料的构成也大概如此。然而在可口可乐中有1%的东西是其他绝对有的,据说就是这个神秘的1%,使它每年有4亿多美的纯利润,而其他品牌的饮料,每年有8000万美的收入就算满意了。
  在这世界上成与败之间的距离就这么一点点,所谓秘诀也就这一点点,但就这一点点东西是最宝贵的,许多人要用多次的失败才换回它,然后走向成功。穷人:如果知道了某种生意的秘诀,然后从事这个项目就容易成功吗?
  富人:各种生意都有自己的小秘密,谁也不会把这小秘密告诉别人,因为有的是不能摆到桌面上的,另外也怕被别人学走了,他们都把它列入了祖传秘方。那个诊所的朋友,他告诉我,一个诊所要赚钱,原则上:一要便宜,二要有效。但你如果死照这原则做,是不了钱的。既然便宜你收费就不能贵,有效的话,病一次就看好了,这样赚的钱除了打点主管部门、房租、员工工资,以及七七八八的社会各种收费所剩无几了……不如剩早关门。你要从事什么行业,你就要先去跟从事这行业的人做朋友或先到他那里做雇员最好同,用心就能学到这个祖传秘方。这比自己损失了不少时间在实践中慢慢摸索要合算得多。
  小老板做事,中老板做市,大老板做势!
  我们许多人用体力赚钱,不少人用技术赚钱,很少人用知识赚钱,极少人是用智慧赚钱的。在财富时代,智慧的人太少太少,有智慧又能抓住商机的人更是凤毛麟角。只要我们开动脑筋,发挥智慧,就可以把握机会,成为财富的主人。
 

手把手教你做兼职,很适合上班族与学生想发大财的就不要来了,赚个话费

???
 

원천:php.cn
본 웹사이트의 성명
본 글의 내용은 네티즌들의 자발적인 기여로 작성되었으며, 저작권은 원저작자에게 있습니다. 본 사이트는 이에 상응하는 법적 책임을 지지 않습니다. 표절이나 침해가 의심되는 콘텐츠를 발견한 경우 admin@php.cn으로 문의하세요.
최신 이슈
인기 추천
인기 튜토리얼
더>
최신 다운로드
더>
웹 효과
웹사이트 소스 코드
웹사이트 자료
프론트엔드 템플릿