目錄
手把手教你做关键词匹配项目(搜索引擎)---- 第二十天,教你做第二十天
首頁 php教程 php手册 手把手教你做关键词匹配项目(搜索引擎)---- 第二十天,教你做第二十天

手把手教你做关键词匹配项目(搜索引擎)---- 第二十天,教你做第二十天

Jun 13, 2016 am 09:25 AM
- 關鍵字 匹配 手把手 搜尋引擎 教你 專案

手把手教你做关键词匹配项目(搜索引擎)---- 第二十天,教你做第二十天

客串:屌丝的坑人表单神器、数据库那点事儿

面向对象升华:面向对象的认识----新生的初识、面向对象的番外----思想的梦游篇(1)、面向对象的认识---如何找出类

负载均衡:负载均衡----概念认识篇、负载均衡----实现配置篇(Nginx)

吐槽:有人反馈了这样的一个信息,说该文章越到最后越难看懂,跟不上节奏,也有的人说小帅帅的能力怎么飙的那么快,是不是我比较蠢。也有的直接看文字,不看代码,代码太难懂了。

其实我这几天也一直在思考这个问题,所以没办法就去开展了一些面向对象的课程,希望对那些跟不上的有些帮助。其实说真的,读者不反馈的话,我只好按照我认为的小帅帅去开展课程了。

 

第二十天

起点:手把手教你做关键词匹配项目(搜索引擎)---- 第一天

回顾:手把手教你做关键词匹配项目(搜索引擎)---- 第十九天

话说小帅帅为了解决那个分词算法写出了初版,他拿给于老大看的时候,被要求重写了。

原因有以下几点:

    1. 如何测试,测试数据呢?

    2. Splitter是不是做了太多事情?

    3. 连衣裙xxl裙连衣裙这种 有重复词组怎么办?

小帅帅拿着这些问题,开始重构。

首先他发现了这点,中文、英文和中英文的判断,以及长度的计算,他把这个写成了类:

<?<span>php

</span><span>class</span><span> UTF8 {

    </span><span>/*</span><span>*
     * 检测是否utf8
     * @param $char
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>static</span> <span>function</span> is(<span>$char</span><span>){
        </span><span>return</span> (<span>preg_match</span>("/^([".<span>chr</span>(228)."-".<span>chr</span>(233)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}){1}/",<span>$char</span>) ||
            <span>preg_match</span>("/([".<span>chr</span>(228)."-".<span>chr</span>(233)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}){1}$/",<span>$char</span>) ||
            <span>preg_match</span>("/([".<span>chr</span>(228)."-".<span>chr</span>(233)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}[".<span>chr</span>(128)."-".<span>chr</span>(191)."]{1}){2,}/",<span>$char</span><span>));
    }

    </span><span>/*</span><span>*
     * 计算utf8字的个数
     * @param $char
     * @return float|int
     </span><span>*/</span>
    <span>public</span> <span>static</span> <span>function</span> length(<span>$char</span><span>) {

        </span><span>if</span>(self::is(<span>$char</span><span>))
            </span><span>return</span> <span>ceil</span>(<span>strlen</span>(<span>$char</span>)/3<span>);
        </span><span>return</span> <span>strlen</span>(<span>$char</span><span>);
    }

    </span><span>/*</span><span>*
     * 检测是否为词组
     * @param $word
     * @return bool
     </span><span>*/</span>
    <span>public</span> <span>static</span> <span>function</span> isPhrase(<span>$word</span><span>){

        </span><span>if</span>(self::length(<span>$word</span>)<=1<span>)
            </span><span>return</span> <span>false</span><span>;
        </span><span>return</span> <span>true</span><span>;
    }

}</span>
登入後複製

小帅帅又考虑到词典的来源有可能来自多个地方,比如我给的测试数据,这样不就是可以解决于老大说到无法测试的问题了,小帅帅把词典的来源抽成了个类,类如下:

<?<span>php

</span><span>class</span><span> DBSegmentation {

    </span><span>public</span> <span>$cid</span><span>;

    </span><span>/*</span><span>*
     * 获取类目下分词的词组数据
     * @return array
     </span><span>*/</span>
    <span>public</span> <span>function</span><span> transferDictionary(){
        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>$sql</span> = "select word from category_linklist where cid='<span>$this</span>->cid'"<span>;
        </span><span>$words</span> = DB::makeArray(<span>$sql</span><span>);
        </span><span>foreach</span>(<span>$words</span> <span>as</span> <span>$strWords</span><span>){
            </span><span>$words</span> = <span>explode</span>(",",<span>$strWords</span><span>);

            </span><span>foreach</span>(<span>$words</span> <span>as</span> <span>$word</span><span>){
                </span><span>if</span>(UTF8::isPhrase(<span>$word</span><span>)){
                    </span><span>$ret</span>[] = <span>$word</span><span>;
                }
            }
        }
        </span><span>return</span> <span>$ret</span><span>;
    }
} 

</span><span>class</span><span> TestSegmentation {
    
    </span><span>public</span> <span>function</span><span> transferDictionary(){
        </span><span>$words</span> = <span>array</span><span>(
            </span>"连衣裙,连衣",
            "XXL,xxl,加大,加大码",
            "X码,中码",
            "外套,衣,衣服,外衣,上衣",
            "女款,女士,女生,女性"<span>
        );

        </span><span>$ret</span> = <span>array</span><span>();
        </span><span>foreach</span>(<span>$words</span> <span>as</span> <span>$strWords</span><span>){
            </span><span>$words</span> = <span>explode</span>(",",<span>$strWords</span><span>);

            </span><span>foreach</span>(<span>$words</span> <span>as</span> <span>$word</span><span>){
                </span><span>if</span>(UTF8::isPhrase(<span>$word</span><span>)){
                    </span><span>$ret</span>[] = <span>$word</span><span>;
                }
            }
        }
        </span><span>return</span> <span>$ret</span><span>;

    }
}</span>
登入後複製

那么Splitter 就专心分词把,代码如下:

<span>class</span><span> Splitter {

    </span><span>public</span> <span>$keyword</span><span>;
    </span><span>private</span> <span>$dictionary</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> setDictionary(<span>$dictionary</span> = <span>array</span><span>()){

        </span><span>usort</span>(<span>$dictionary</span>,<span>function</span>(<span>$a</span>,<span>$b</span><span>){
            </span><span>return</span> (UTF8::length(<span>$a</span>)>UTF8::length(<span>$b</span>))?1:-1<span>;
        });

        </span><span>$this</span>->dictionary = <span>$dictionary</span><span>;
    }

    </span><span>public</span> <span>function</span><span> getDictionary(){
        </span><span>return</span> <span>$this</span>-><span>dictionary;
    }

    </span><span>/*</span><span>*
     * 把关键词拆分成词组或者单词
     * @return KeywordEntity $keywordEntity
     </span><span>*/</span>
    <span>public</span> <span>function</span> <span>split</span><span>(){

        </span><span>$remainKeyword</span> = <span>$this</span>-><span>keyword;

        </span><span>$keywordEntity</span> = <span>new</span> KeywordEntity(<span>$this</span>-><span>keyword);

        </span><span>foreach</span>(<span>$this</span>->dictionary <span>as</span> <span>$phrase</span><span>){

            </span><span>$matchTimes</span> = <span>preg_match_all</span>("/<span>$phrase</span>/",<span>$remainKeyword</span>,<span>$matches</span><span>);
            </span><span>if</span>(<span>$matchTimes</span>>0<span>){
                </span><span>$keywordEntity</span>->addElement(<span>$phrase</span>,<span>$matchTimes</span><span>);

                </span><span>$remainKeyword</span> = <span>str_replace</span>(<span>$phrase</span>,"::",<span>$remainKeyword</span><span>);
            }
        }

        </span><span>$remainKeywords</span> = <span>explode</span>("::",<span>$remainKeyword</span><span>);
        </span><span>foreach</span>(<span>$remainKeywords</span> <span>as</span> <span>$splitWord</span><span>){

            </span><span>if</span>(!<span>empty</span>(<span>$splitWord</span><span>)){
                </span><span>$keywordEntity</span>->addElement(<span>$splitWord</span><span>);
            }
        }

        </span><span>return</span> <span>$keywordEntity</span><span>;

    }

}


</span><span>class</span><span> KeywordEntity {

    </span><span>public</span> <span>$keyword</span><span>;
    </span><span>public</span> <span>$elements</span> = <span>array</span><span>();

    </span><span>public</span> <span>function</span> __construct(<span>$keyword</span><span>){
        </span><span>$this</span>->keyword = <span>$keyword</span><span>;
    }

    </span><span>public</span> <span>function</span> addElement(<span>$word</span>,<span>$times</span>=1<span>){

        </span><span>if</span>(<span>isset</span>(<span>$this</span>->elements[<span>$word</span><span>])){
            </span><span>$this</span>->elements[<span>$word</span>]->times += <span>$times</span><span>;
        }</span><span>else</span>
            <span>$this</span>->elements[] = <span>new</span> KeywordElement(<span>$word</span>,<span>$times</span><span>);
    }

    </span><span>/*</span><span>*
     * @desc 计算UTF8字符串权重
     * @param string $word
     * @return float
     </span><span>*/</span>
    <span>public</span> <span>function</span> calculateWeight(<span>$word</span><span>)
    {
        </span><span>$element</span> = <span>$this</span>->elements[<span>$word</span><span>];
        </span><span>return</span> <span>ROUND</span>(<span>strlen</span>(<span>$element</span>->word)*<span>$element</span>->times / <span>strlen</span>(<span>$this</span>->keyword), 3<span>);
    }
}


</span><span>class</span><span> KeywordElement {
    </span><span>public</span> <span>$word</span><span>;
    </span><span>public</span> <span>$times</span><span>;

    </span><span>public</span> <span>function</span> __construct(<span>$word</span>,<span>$times</span><span>){
        </span><span>$this</span>->word = <span>$word</span><span>;
        </span><span>$this</span>->times = <span>$times</span><span>;
    }
}</span>
登入後複製

他把算权重的也丢给了一个类专门去处理。

小帅帅写完之后,也顺手写了测试实例:

<?<span>php

</span><span>$segmentation</span> = <span>new</span><span> TestSegmentation();

</span><span>$splitter</span> = <span>new</span><span> Splitter();
</span><span>$splitter</span>->setDictionary(<span>$segmentation</span>-><span>transferDictionary());
</span><span>$splitter</span>->keyword = "连衣裙xxl裙连衣裙"<span>;
</span><span>$keywordEntity</span> = <span>$splitter</span>-><span>split</span><span>();

</span><span>var_dump</span>(<span>$keywordEntity</span>);
登入後複製

 

这样就算你的算法怎么改,它也能从容面对了。

 

小帅帅理解了这个,当你觉得类做的事情太多的时候,可以考虑下单一职责原则。

 

单一职责原则:一个类,只有一个引起它变化的原因。应该只有一个职责。每一个职责都是变化的一个轴线,如果一个类有一个以上的职责,这些职责就耦合在了一起。这会导致脆弱的设计。当一个职责发生变化时,可能会影响其它的职责。另外,多个职责耦合在一起,会影响复用性。例如:要实现逻辑和界面的分离。【来自百度百科】

 

当于老大提到是不是有其他分词算法的时候,我们能不能拿来用,小帅帅很高兴,因为现在它的代码是多么美好。

小帅帅如何玩转第三方分词扩展,请继续关注下回分解:手把手教你做关键词匹配项目(搜索引擎)---- 第二十一天

 

本網站聲明
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn

熱AI工具

Undresser.AI Undress

Undresser.AI Undress

人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover

AI Clothes Remover

用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool

Undress AI Tool

免費脫衣圖片

Clothoff.io

Clothoff.io

AI脫衣器

AI Hentai Generator

AI Hentai Generator

免費產生 AI 無盡。

熱門文章

R.E.P.O.能量晶體解釋及其做什麼(黃色晶體)
2 週前 By 尊渡假赌尊渡假赌尊渡假赌
倉庫:如何復興隊友
4 週前 By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island冒險:如何獲得巨型種子
3 週前 By 尊渡假赌尊渡假赌尊渡假赌

熱工具

記事本++7.3.1

記事本++7.3.1

好用且免費的程式碼編輯器

SublimeText3漢化版

SublimeText3漢化版

中文版,非常好用

禪工作室 13.0.1

禪工作室 13.0.1

強大的PHP整合開發環境

Dreamweaver CS6

Dreamweaver CS6

視覺化網頁開發工具

SublimeText3 Mac版

SublimeText3 Mac版

神級程式碼編輯軟體(SublimeText3)

小米 14 Ultra怎麼調整光圈? 小米 14 Ultra怎麼調整光圈? Mar 19, 2024 am 09:01 AM

小米 14 Ultra怎麼調整光圈?

AI攻克費馬大定理?數學家放棄5年職業生涯,將100頁證明變代碼 AI攻克費馬大定理?數學家放棄5年職業生涯,將100頁證明變代碼 Apr 09, 2024 pm 03:20 PM

AI攻克費馬大定理?數學家放棄5年職業生涯,將100頁證明變代碼

Cheat Engine怎麼設定中文?ce修改器設定中文的方法 Cheat Engine怎麼設定中文?ce修改器設定中文的方法 Mar 18, 2024 pm 01:20 PM

Cheat Engine怎麼設定中文?ce修改器設定中文的方法

DaVinci Resolve Studio 已支援AMD顯示卡的AV1硬體編碼 DaVinci Resolve Studio 已支援AMD顯示卡的AV1硬體編碼 Mar 06, 2024 pm 10:04 PM

DaVinci Resolve Studio 已支援AMD顯示卡的AV1硬體編碼

教你使用 iOS 17.4「失竊裝置保護」新進階功能 教你使用 iOS 17.4「失竊裝置保護」新進階功能 Mar 10, 2024 pm 04:34 PM

教你使用 iOS 17.4「失竊裝置保護」新進階功能

榮耀 90 GT怎麼更新榮耀MagicOS 8.0? 榮耀 90 GT怎麼更新榮耀MagicOS 8.0? Mar 18, 2024 pm 06:46 PM

榮耀 90 GT怎麼更新榮耀MagicOS 8.0?

Planet Mojo:從自走棋遊戲Mojo Melee建起Web3遊戲元宇宙 Planet Mojo:從自走棋遊戲Mojo Melee建起Web3遊戲元宇宙 Mar 14, 2024 pm 05:55 PM

Planet Mojo:從自走棋遊戲Mojo Melee建起Web3遊戲元宇宙

用Golang函數簡化檔案上傳處理 用Golang函數簡化檔案上傳處理 May 02, 2024 pm 06:45 PM

用Golang函數簡化檔案上傳處理

See all articles