This article mainly introduces you to the method of using PHP to implement the TopK-algorithm using a binary heap. The introduction in the article is very detailed and has certain reference and learning value for everyone. Friends who need it can follow the editor to learn together. Bar.
Preface
In the past, I often encountered a problem when working or interviewing. How to achieve massive TopN is to achieve a very large result. To quickly find the largest 10 or 100 numbers in a set, while ensuring memory and speed efficiency, our first idea may be to use sorting, and then intercept the top 10 or 100, and sorting is not suitable when the amount is not particularly large. There is no problem, but as long as the amount is extremely large, it is impossible to complete this task. For example, if there are hundreds of millions of numbers in an array or text file, it is impossible to read them all into the memory, so using sorting to solve this problem is not The best, so here we use PHP to implement a small top heap to solve this problem. A heap is a special kind of heap. A binary heap is a complete binary tree or an approximately complete binary tree. There are two types of binary heaps, the maximum heap and the minimum heap. The maximum heap: the key value of the parent node is always greater than or equal to any child. The key value of the node; the minimum heap: the key value of the parent node is always less than or equal to the key value of any child node
小头 heap-(picture from the Internet)
Binary heaps are generally represented by arrays (see the figure above). For example, the position of the root node in the array is 0, and the child nodes at the nth position are respectively 2n 1 and 2n 2. Therefore, the The child nodes of position 0 are at 1 and 2, the child nodes of 1 are at 3 and 4, and so on. This storage method makes it easy to find parent nodes and child nodes. I won’t go into details about the specific conceptual issues here. If you have any questions about the binary heap, you can have a good understanding of this data structure. Next, we will use PHP code to implement and solve the above topN problems. In order to see To find out the difference, let’s first use the sorting method to implement it and see how it works.//为了测试运行内存调大一点 ini_set('memory_limit', '2024M'); //实现一个快速排序函数 function quick_sort(array $array){ $length = count($array); $left_array = array(); $right_array = array(); if($length <= 1){ return $array; } $key = $array[0]; for($i=1;$i<$length;$i++){ if($array[$i] > $key){ $right_array[] = $array[$i]; }else{ $left_array[] = $array[$i]; } } $left_array = quick_sort($left_array); $right_array = quick_sort($right_array); return array_merge($right_array,array($key),$left_array); } //构造500w不重复数 for($i=0;$i<5000000;$i++){ $numArr[] = $i; } //打乱它们 shuffle($numArr); //现在我们从里面找到top10最大的数 var_dump(time()); print_r(array_slice(quick_sort($all),0,10)); var_dump(time());
You can see that the top10 results are printed above, and the running time is output, which is about 99s, but this is only a 5 million number and If everything can be loaded into the memory, if we have a file with 5kw or 500 million numbers, there will definitely be some problems.
Use the binary heap algorithm to implement TopNThe implementation process is: 1. First read 10 or 100 numbers into the array, this is Our topN number.
2. Call the generate small top heap function to generate a small top heap structure from this array. At this time, the top of the heap must be the smallest.
3. Traverse all the remaining numbers from the file or array in sequence.
4. Each time one is traversed, compare the size with the element on the top of the heap. If it is smaller than the element on the top of the heap, discard it. If it is larger than the element on the top of the heap, it will be discarded. Replace the top element.
5. After replacing the top element of the heap, call the generate small top heap function to continue generating the small top heap, because you need to find the smallest one.
6. Repeat the above steps 4~5, so that after all traversals are completed, the largest topN will be in our small top heap, because our small top heap will always exclude the smallest and leave the largest , and the speed of adjusting the small top heap is also very fast. It is just a relative adjustment, as long as the root node is smaller than the left and right nodes.
7. In terms of algorithm complexity, according to top10 in the worst case, That is, every time a number is traversed, if it is replaced with the top of the heap, it needs to be adjusted 10 times, which is faster than the sorting speed, and it does not read all the contents into the memory. It can be understood that the achievement is a linear traversal.
//生成小顶堆函数 function Heap(&$arr,$idx){ $left = ($idx << 1) + 1; $right = ($idx << 1) + 2; if (!$arr[$left]){ return; } if($arr[$right] && $arr[$right] < $arr[$left]){ $l = $right; }else{ $l = $left; } if ($arr[$idx] > $arr[$l]){ $tmp = $arr[$idx]; $arr[$idx] = $arr[$l]; $arr[$l] = $tmp; Heap($arr,$l); } } //这里为了保证跟上面一致,也构造500w不重复数 /* 当然这个数据集并不一定全放在内存,也可以在 文件里面,因为我们并不是全部加载到内存去进 行排序 */ for($i=0;$i<5000000;$i++){ $numArr[] = $i; } //打乱它们 shuffle($numArr); //先取出10个到数组 $topArr = array_slice($numArr,0,10); //获取最后一个有子节点的索引位置 //因为在构造小顶堆的时候是从最后一个有左或右节点的位置 //开始从下往上不断的进行移动构造(具体可看上面的图去理解) $idx = floor(count($topArr) / 2) - 1; //生成小顶堆 for($i=$idx;$i>=0;$i--){ Heap($topArr,$i); } var_dump(time()); //这里可以看到,就是开始遍历剩下的所有元素 for($i = count($topArr); $i < count($numArr); $i++){ //每遍历一个则跟堆顶元素进行比较大小 if ($numArr[$i] > $topArr[0]){ //如果大于堆顶元素则替换 $topArr[0] = $numArr[$i]; /* 重新调用生成小顶堆函数进行维护,只不过这次是从堆顶 的索引位置开始自上往下进行维护,因为我们只是把堆顶 的元素给替换掉了而其余的还是按照根节点小于左右节点 的顺序摆放这也就是我们上面说的,只是相对调整下,并 不是全部调整一遍 */ Heap($topArr,0); } } var_dump(time());
You can see that the final result is also top10, only However, it only took about 1 second, and both memory and time efficiency met our requirements. And the best thing about sorting is that we don’t need to read all the data sets into the memory, because we don’t need to sort. The above is for demonstration, so 5 million elements are constructed directly in the memory. However, we can transfer all of them to a file, and then read them line by line for comparison, because the core point of our data structure is linear traversal and there are many elements in the memory. Compare the small top heap structures and finally get TopN.
The above is the entire content of this article, I hope it will be helpful to everyone's study.
Related recommendations:
php Implement hexadecimal conversion_php skills
How to compile redis and phpredis under Linux_php skills
##PHPHow to use the Mysqli class library to achieve perfect paging effect_php skills
The above is the detailed content of PHP uses binary heap to implement TopK-algorithm. For more information, please follow other related articles on the PHP Chinese website!