Optimize the matching of huge amounts of keywords-PHP Tutorial-php.cn

There are 600,000 short message logs, each of which 50 words, 50,000 keywords, 2-8 words in length, most of them in Chinese. It is required to extract all the keywords contained in these 600,000 records and count the number of hits for each keyword.

This article gives a complete introduction to my implementation method to see how I can optimize a task that takes ten hours to run to within ten minutes. Although the implementation language is PHP, more ideas introduced in this article should be able to give you some help.

Original – grep

Design

When I first received the task, my little mind immediately changed, Log + Keywords + Statistics, I did not think of writing the code myself, but first thought of the commonly used log statistics command under Linux grep.

grep I won’t mention the usage of the command anymore. Use grep 'keyword' | wc -l which can be easily It is convenient to count the number of information items hit by keywords, and PHP's exec() function allows us to directly call Linux shell commands, although there will be security risks when executing dangerous commands. .

Code

Pseudo code:

foreach ($word_list as $keyword) {
    $count = intval(exec("grep &#39;{$keyword}&#39; file.log | wc -l"));
    record($keyword, $count);
}

Copy after login

It was run on an old machine. The efficiency of the old machine was really poor, and it took 6 hours to run. It is estimated that the latest machine will take 2-3 hours. All subsequent optimizations will use new machines, and the requirements have changed. The main text has just begun.

Original, original in ideas and methods.

Evolution – Regular

Design

After handing in the job, the product came up with new ideas the next day, saying that it wanted to connect a certain data source in the future, the news Delivered as a data stream, not a file. It also requires the real-time nature of message statistics. My idea of writing the data to a file and then counting it was overturned. For the scalability of the solution, the current statistical object is no longer a whole, but needs to be considered. A single message was matched.

At this time, I was a little confused and had to resort to the most traditional tool - regularity. The implementation of regular expressions is not difficult, and each language has encapsulated regular matching functions. The focus is on the construction of patterns.

Of course, the pattern construction here is not difficult, /keywordOptimize the matching of huge amounts of keywords|keword2|.../, use |Just connect the keywords.

Regular pitfalls

Here are two pitfalls encountered in use:

The length of the regular pattern is too long, causing matching failure: PHP's The regex has a backtracking limit to prevent consuming all the available stack of the process, eventually causing PHP to crash. Patterns that are too long will cause PHP to detect too many tracebacks and interrupt matching. After testing, the maximum pattern length is about 32,000 bytes in the default setting. The pcre.backtrack_limit parameter in php.ini is the maximum number of backtracking times. The default value is Optimize the matching of huge amounts of keywords000000. Modify it or php.ini or in Use ini_set('pcre.backtrack_limit', n); at the beginning of the script. Setting it to a larger number can increase the maximum pattern length for a single match. Of course, you can also count keywords in batches (I used this =_=).
The pattern contains special characters, causing a large number of warnings: During the matching process, it was found that PHP reported a large number of warnings: unknown modifier Garbled characters, careful inspection found that there are / characters in the keywords, you can use the preg_quote() function to filter the keywords.

Code

Pseudo code:

$end = 0;
$step = Optimize the matching of huge amounts of keywords500;
$pattern = array();
// 先将pattern 拆成多个小块
while ($end < count($word_list)) {
    $tmp_arr = array_slice($word_list, $end, $step);
    $end += $step;
    $item = implode(&#39;|&#39;, $tmp_arr);
    $pattern[] = preg_quote($item);
}
 
$content = file_get_contents($log_file);
$lines = explode("\n", $content);
foreach ($lines as $line) {
    // 使用各小块pattern分别匹配
    for ($i = 0; $i < count($pattern); $i++) {
        preg_match_all("/{$pattern[$i]}/", $line, $match);
    }
    $match = array_unique(array_filter($match));
    dealResult($match);
}

Copy after login

In order to complete the task, the process ran all night. When I found out the next day that I had run for nearly ten hours, I was heartbroken. . . It was too slow and completely failed to meet the usage requirements. At this time, I had already begun to consider changing the method.

When the product changed its keyword strategy, replaced some keywords, asked to run it again, and said that it would continue to optimize keywords, I completely rejected the existing plan. You must not use keywords to match information. Using all the keywords to match one by one is really unbearable.

Evolution, Evolution of Needs and Realization

Awakening – Splitting Words

Design

I finally began to realize that to get information Go to keywords to compare. If I use keywords as keys to create a hash table, use the words in the information to search in the hash table, and if found, it will be considered a match. Wouldn't this achieve O(Optimize the matching of huge amounts of keywords) efficiency?

But a short message, how do I split it into just the right words to match? Word segmentation? Word segmentation also takes time, and my keywords are all meaningless words. Building a vocabulary and using word segmentation tools are big problems. Finally, I thought of splitting words .

为什么叫拆词呢，我考虑以蛮力将一句话拆分为所有可能的词。如（我是好人）就可以拆成（我是、是好、好人、我是好、是好人、我是好人）等词，我的关键词长度为 2-8，所以可拆词个数会随着句子长度迅速增加。不过，可以用标点符号、空格、语气词（如的、是等）作为分隔将句子拆成小短语再进行拆词，会大大减少拆出的词量。

其实分词并没有完整实现就被后一个方法替代了，只是一个极具实现可能的构想，写这篇文章时用伪代码实现了一下，供大家参考，即使不用在匹配关键词，用在其他地方也是有可能的。

代码

$str_list = getStrList($msg);
foreach ($str_list as $str) {
    $keywords = getKeywords($str);
    foreach ($keywords as $keyword) {
        // 直接通过PHP数组的哈希实现来进行快速查找
        if (isset($word_list[$keyword])) {
            record($keyword);
        }
    }
}
/**
* 从消息中拆出短句子
*/
function getStrList($msg) {
    $str_list = array();
    $seperators = array(&#39;，&#39;, &#39;。&#39;, &#39;的&#39;, ...);
 
    $words = preg_split(&#39;/(?<!^)(?!$)/u&#39;, $msg);
    $str = array();
    foreach ($words as $word) {
        if (in_array($word, $seperators)) {
            $str_list[] = $str;
            $str = array();
        } else {
            $str[] = $word;
        }
    }
 
    return array_filter($str_list);
}
 
/**
* 从短句中取出各个词
*/
function getKeywords($str) {
    if (count($str) < 2) {
        return array();
    }
 
    $keywords = array();
    for ($i = 0; $i < count($str); $i++) {
        for ($j = 2; $j < 9; $j++) {
            $keywords[] = array_slice($str, $i, $j); // todo 限制一下不要超过数组最大长度
        }
    }
 
    return $keywords;
}

Copy after login

结果

我们知道一个 utf-8 的中文字符要占用三个字节，为了拆分出包含中英文的每一个字符，使用简单的 split() 函数是做不到的。

这里使用了 preg_split('/(?<!^)(?!$)/u', $msg) 是通过正则匹配到两个字符之间的''来将两个字符拆散，而两个括号里的 (?<!^)(?!$) 是分别用来限定捕获组不是第一个，也不是最后一个（不使用这两个捕获组限定符也是可以的，直接使用//作为模式会导致拆分结果在前后各多出一个空字符串项）。捕获组的概念和用法可见我之前的博客 PHP正则中的捕获组与非捕获组

由于没有真正实现，也不知道效率如何。估算每个短句长度约为 Optimize the matching of huge amounts of keywords0 字左右时，每条短消息约50字左右，会拆出 200 个词。虽然它会拆出很多无意义的词，但我相信效率绝不会低，由于其 hash 的高效率，甚至我觉得会可能比终极方法效率要高。

最终没有使用此方案是因为它对句子要求较高，拆词时的分隔符也不好确定，最重要的是它不够优雅。。。这个方法我不太想去实现，统计标识和语气词等活显得略为笨重，而且感觉拆出很多无意义的词感觉效率浪费得厉害。

觉醒，意识和思路的觉醒

终级 – Trie树

trie树

于是我又来找谷哥帮忙了，搜索大量数据匹配，有人提出了使用 trie 树的方式，没想到刚学习的 trie 树的就派上了用场。我上上篇文章刚介绍了 trie 树，在空间索引 – 四叉树里字典树这一小节，大家可以查看一下。

当然也为懒人复制了一遍我当时的解释(看过的可以跳过这一小节了)。

字典树，又称前缀树或 trie 树，是一种有序树，用于保存关联数组，其中的键通常是字符串。与二叉查找树不同，键不是直接保存在节点中，而是由节点在树中的位置决定。一个节点的所有子孙都有相同的前缀，也就是这个节点对应的字符串，而根节点对应空字符串。

我们可以类比字典的特性：我们在字典里通过拼音查找晃(huang)这个字的时候，我们会发现它的附近都是读音为huang的，可能是声调有区别，再往前翻，我们会看到读音前缀为huan的字，再往前，是读音前缀为hua的字… 取它们的读音前缀分别为 h qu hua huan huang。我们在查找时，根据 abc...xyz 的顺序找到h前缀的部分，再根据 ha he hu找到 hu 前缀的部分…最后找到 huang，我们会发现，越往后其读音前缀越长，查找也越精确，这种类似于字典的树结构就是字典树，也是前缀树。

设计

那么 trie 树怎么实现关键字的匹配呢? 这里以一幅图来讲解 trie 树匹配的过程。

Optimize the matching of huge amounts of keywords

其中要点：

构造trie树

将关键词用上面介绍的preg_split()函数拆分为单个字符。如科学家就拆分为科、学、家三个字符。
在最后一个字符后添加一个特殊字符`，此字符作为一个关键词的结尾（图中的粉红三角），以此字符来标识查到了一个关键词（不然，我们不知道匹配到科、学两个字符时算不算匹配成功）。
检查根部是否有第一个字符(科)节点，如果有了此节点，到步骤4。如果还没有，在根部添加值为科的节点。
依次检查并添加学、家 两个节点。
在结尾添加 ` 节点，并继续下一个关键词的插入。

匹配

然后我们以 这位科学家很了不起！为例来发起匹配。

首先我们将句子拆分为单个字符 这、位、...；
从根查询第一个字符这，并没有以这个字符开头的关键词，将字符“指针”向后移，直到找到根下有的字符节点科;
接着在节点科下寻找值为 学节点，找到时，结果子树的深度已经到了2，关键词的最短长度是2，此时需要在学结点下查找是否有`，找到意味着匹配成功，返回关键词，并将字符“指针”后移，如果找不到则继续在此结点下寻找下一个字符。
如此遍历，直到最后，返回所有匹配结果。

代码

完整代码我已经放到了GitHub上：Trie-GitHub-zhenbianshu，这里放上核心。

首先是数据结构树结点的设计，当然它也是重中之重：

$node = array(
    &#39;depth&#39; => $depth, // 深度，用以判断已命中的字数
    &#39;next&#39; => array(
        $val => $node, // 这里借用php数组的哈希底层实现，加速子结点的查找
        ...
    ),
);

Copy after login

然后是树构建时子结点的插入：

// 这里要往节点内插入子节点，所以将它以引用方式传入
private function insert(&$node, $words) {
         if (empty($words)) {
            return;
        }
        $word = array_shift($words);
        // 如果子结点已存在，向子结点内继续插入
        if (isset($node[&#39;next&#39;][$word])) {
            $this->insert($node[&#39;next&#39;][$word], $words);
        } else {
            // 子结点不存在时，构造子结点插入结果
            $tmp_node = array(
                &#39;depth&#39; => $node[&#39;depth&#39;] + Optimize the matching of huge amounts of keywords,
                &#39;next&#39; => array(),
            );
            $node[&#39;next&#39;][$word] = $tmp_node;
            $this->insert($node[&#39;next&#39;][$word], $words);
        }
    }

Copy after login

最后是查询时的操作：

// 这里也可以使用一个全局变量来存储已匹配到的字符，以替换$matched
private function query($node, $words, &$matched) {
        $word = array_shift($words);
        if (isset($node[&#39;next&#39;][$word])) {
            // 如果存在对应子结点，将它放到结果集里
            array_push($matched, $word);
            // 深度到达最短关键词时，即可判断是否到词尾了
            if ($node[&#39;next&#39;] > Optimize the matching of huge amounts of keywords && isset($node[&#39;next&#39;][$word][&#39;next&#39;][&#39;`&#39;])) {
                return true;
            }
            return $this->query($node[&#39;next&#39;][$word], $words, $matched);
        } else {
            $matched = array();
            return false;
        }
    }

Copy after login

结果

结果当然是喜人的，如此匹配，处理一千条数据只需要3秒左右。找了 Java 的同事试了下，Java 处理一千条数据只需要Optimize the matching of huge amounts of keywords秒。

这里来分析一下为什么这种方法这么快：

正则匹配：要用所有的关键词去信息里匹配匹配次数是 key_len * msg_len，当然正则会进行优化，但基础这样，再优化效率可想而知。
而 trie 树效率最差的时候是 msg_len * 9(最长关键词长度 + Optimize the matching of huge amounts of keywords个特殊字符)次 hash 查找，即最长关键词类似 AAA，信息内容为 AAA...时，而这种情况的概率可想而知。

至此方法的优化到此结束，从每秒钟匹配 Optimize the matching of huge amounts of keywords0 个，到 300 个，30 倍的性能提升还是巨大的。

终级，却不一定是终极

他径 – 多进程

设计

匹配方法的优化结束了，开头说的优化到十分钟以内的目标还没有实现，这时候就要考虑一些其他方法了。

我们一提到高效，必然想到的是 并发，那么接下来的优化就要从并发说起。PHP 是单线程的（虽然也有不好用的多线程扩展），这没啥好的解决办法，并发方向只好从多进程进行了。

那么一个日志文件，用多个进程怎么读呢？这里当然也提供几个方案：

Add a log line counter in the process. Each process supports passing in parameter n. The process only processes the log with line number % n = n. I have become very proficient in using the reverse distribution of this hack, haha. This method requires the process to pass parameters, and each process needs to allocate memory to read the entire log, and it is not elegant enough.
Use the split -l n file.log output_pre command of Linux to split the file into files of n lines each, and then use Multiple processes to read multiple files. The disadvantage of this method is that it is inflexible. If you want to change the number of processes, you need to re-split the file.
Use Redis's list queue to temporarily store logs and enable multiple process consumption queues. This method requires additional writing of data into Redis, which is an extra step, but it is flexible in expansion and the code is simple and elegant.

In the end, the third method was used.

Result

Although this method will have bottlenecks, it should eventually fall on Redis's network IO. I didn't bother to open n processes to challenge the performance of the company's Redis. I completed the statistics after running Optimize the matching of huge amounts of keywords0 processes in three or four minutes. Even if you add in the time it takes to write to Redis, it should be within Optimize the matching of huge amounts of keywords0 minutes.

At the beginning, the product had already positioned the matching speed at an hourly level. When I took out the new log matching results in Optimize the matching of huge amounts of keywords0 minutes, I felt a little happy when I saw the surprised expression of the product, haha~

Other paths can also help you go further

Summary

There are many ways to solve problems. I think that in solving various Before asking questions, you need to understand many kinds of knowledge, even if you only know its function. Just like a tool rack, you have to put as many tools as possible before you can choose the most suitable one when you encounter a problem. Then of course you need to become proficient in using these tools, so that you can use them to solve some weird problems.

If you want to do your job well, you must first sharpen your tools. If you want to solve performance problems, mastering system-level methods is not enough. Sometimes, changing a data structure or algorithm may have better results. I feel that I am still a little weak in this aspect, so I will gradually strengthen it, and everyone will encourage me.

The above is the detailed content of Optimize the matching of huge amounts of keywords. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7613

CakePHP Tutorial

1387

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

136

Related knowledge

Explain what the explorer.exe process is Feb 18, 2024 pm 12:11 PM

What process is explorer.exe? When we use the Windows operating system, we often hear the term "explorer.exe". So, are you curious about what this process is? In this article, we will explain in detail what process explorer.exe is and its functions and effects. First of all, explorer.exe is a key process of the Windows operating system. It is responsible for managing and controlling Windows Explorer (Window

How to adjust aperture on Xiaomi Mi 14 Ultra? Mar 19, 2024 am 09:01 AM

Adjusting the aperture size has a crucial impact on the photo effect. Xiaomi Mi 14 Ultra provides unprecedented flexibility in camera aperture adjustment. In order to allow everyone to adjust the aperture smoothly and realize the free adjustment of the aperture size, the editor here brings you a detailed tutorial on how to set the aperture on Xiaomi Mi 14Ultra. How to adjust the aperture on Xiaomi Mi 14Ultra? Start the camera, switch to "Professional Mode", and select the main camera - W lens. Click on the aperture, open the aperture dial, A is automatic, select f/1.9 or f/4.0 as needed.

What is the highest graphics card that r5 5600x can drive? The latest performance of using 5600X with RX6800XT Feb 25, 2024 am 10:34 AM

On October 29, AMD finally released a much-anticipated blockbuster product, the RX6000 series of gaming graphics cards based on the new RDNA2 architecture. This graphics card complements the previously launched Ryzen 5000 series processors based on the new ZEN3 architecture, forming a new double-A combination. This release not only eclipsed the competitor "Shuangying", but also had a major impact on the entire DIY hardware circle. Next, let’s use the combination of AMD Ryzen 5600X and RX6800XT in my hands as a test example to see how awesome AMD is today. Let’s talk about the CPU processor part first. The previous generation of AMD Ryzen 3000 series processors using ZEN2 architecture has actually been used.

How to set Chinese in Cheat Engine? How to set Chinese in ce modifier Mar 18, 2024 pm 01:20 PM

Ce Modifier (CheatEngine) is a game modification tool dedicated to modifying and editing game memory. So how to set Chinese in CheatEngine? Next, the editor will tell you how to set Chinese in Ce Modifier. I hope it can Help friends in need. In the new software we download, it can be confusing to find that the interface is not in Chinese. Even though this software was not developed in China, there are ways to convert it to the Chinese version. This problem can be solved by simply applying the Chinese patch. After downloading and installing the CheatEngine (ce modifier) software, open the installation location and find the folder named languages, as shown in the figure below

What does the 0x0000004e error mean? Feb 18, 2024 pm 01:54 PM

What is 0x0000004e failure? Failure is a common problem in computer systems. When a computer encounters a fault, the system usually shuts down, crashes, or displays error messages because it cannot run properly. In Windows systems, there is a specific fault code 0x0000004e, which is a blue screen error code indicating that the system has encountered a serious error. The 0x0000004e blue screen error is caused by system kernel or driver issues. This error usually causes the computer system to

Which has a greater impact on performance, memory frequency or timing? Feb 19, 2024 am 08:58 AM

Memory is one of the most important components in the computer, and it has a significant impact on the performance and stability of the computer. When choosing memory, people tend to focus on two important parameters, namely timing and frequency. So, for memory performance, which is more important, timing or frequency? First, let's understand the concepts of timing and frequency. Timing refers to the time interval required for a memory chip to receive and process data. It is usually represented by a CL value (CASLatency). The smaller the CL value, the faster the memory processing speed. The frequency is within

How to update Honor MagicOS 8.0 on Honor 90 GT? Mar 18, 2024 pm 06:46 PM

Honor 90GT is a cost-effective smartphone with excellent performance and excellent user experience. However, sometimes we may encounter some problems, such as how to update Honor MagicOS8.0 on Honor 90GT? This step may be different for different mobile phones and different models. So, let us discuss how to upgrade the system correctly. How to update Honor MagicOS 8.0 on Honor 90GT? According to news on February 28, Honor today pushed the MagicOS8.0 public beta update for its three mobile phones 90GT/100/100Pro. The package version number is 8.0.0.106 (C00E106R3P1) 1. Ensure your Honor The battery of the 90GT is fully charged;

Planet Mojo: Building a Web3 game metaverse from the auto-chess game Mojo Melee Mar 14, 2024 pm 05:55 PM

Popular Metaverse game projects founded in the last crypto cycle are accelerating their expansion. On March 4, PlanetMojo, the Web3 game metaverse platform, announced a number of important developments in its game ecology, including the announcement of the upcoming parkour game GoGoMojo, the launch of the new season "Way of War" in the flagship auto-chess game MojoMelee, and the celebration of the new The first ETH series "WarBannerNFT" launched this season in cooperation with MagicEden. In addition, PlanetMojo also revealed that they plan to launch Android and iOS mobile versions of MojoMelee later this year. This project will be launched at the end of 2021. After nearly two years of hard work in the bear market, it will soon be completed.

See all articles