Let’s talk about the Top K algorithm first, the details are as follows
Application scenarios:
The search engine will record all the search strings used by the user for each search through log files. The length of each query string is 1-255 bytes.
Assume that there are currently 10 million records (the repetition degree of these query strings is relatively high, although the total number is 10 million, but if the repetitions are removed, it does not exceed 3 million. The higher the repetition degree of a query string, it means that the query string The more users there are, the more popular it is. ), please count the 10 most popular query strings, and the memory required cannot exceed 1G.
Essential knowledge:
What is a hash table?
Hash table (Hash table, also called hash table) is a data structure that is directly accessed based on the key value.
That is, it accesses the record by mapping the key value to a location in the table to speed up the search. This mapping function is called a hash function, and the array storing the records is called a hash table.
The method of hash table is actually very simple. It is to convert the Key into an integer number through a fixed algorithm function, the so-called hash function, and then modulate the number to the length of the array. The remainder result is As the subscript of the array, the value is stored in the array space with the number as the subscript.
When using a hash table for query, the hash function is used again to convert the key into the corresponding array subscript, and locate the space to obtain the value. In this way, the positioning performance of the array can be fully utilized for data processing. position.
Problem analysis:
To count the most popular queries, first count the number of occurrences of each Query, and then find the Top 10 based on the statistical results. So we can design the algorithm in two steps based on this idea.
That is, the solution to this problem is divided into the following two steps:
Step one: Query statistics (count the number of occurrences of each Query)
Query statistics has the following two methods to choose from:
1), direct sorting method (often statistics in the log file, use Cat File | Format Key | Sort | Uniq -C | Sort -NR | Head -N 10, this is the method)
First of all, the algorithm we first think of is sorting. First, we sort all the queries in the log, and then traverse the sorted queries and count the number of times each query appears.
But there are clear requirements in the question, that is, the memory cannot exceed 1G, there are 10 million records, each record is 255Byte, obviously it will occupy 2.375G of memory, this condition does not meet the requirements.
Let us recall the content of the data structure course. When the amount of data is relatively large and cannot be accommodated in the memory, we can use external sorting to sort. Here we can use merge sort, because merge sort has a Better time complexity O(NlgN).
After sorting, we will traverse the ordered Query file, count the number of occurrences of each Query, and write it to the file again.
A comprehensive analysis shows that the time complexity of sorting is O(NlgN), and the time complexity of traversal is O(N), so the overall time complexity of the algorithm is O(N+NlgN)=O(NlgN) .
2), Hash Table method (this method is very good for counting the number of occurrences of a string)
In the first method, we use a sorting method to count the number of occurrences of each Query. The time complexity is NlgN. So is there a better way to store it with lower time complexity?
It is explained in the title that although there are 10 million Query, due to the high degree of repetition, there are actually only 3 million Query, each Query is 255Byte, so we can consider putting them all into memory. Now we just need a suitable data structure. Here, Hash Table is definitely our priority choice, because the query speed of Hash Table is very fast, with almost O(1) time complexity.
So, we have our algorithm:
Maintain a HashTable whose Key is Query string and Value is the number of occurrences of Query. Each time a Query is read, if the string is not in the Table, then add the string and set the Value to 1; if If the string is in Table, then add one to the count of the string. Finally, we completed the processing of this massive data within O(N) time complexity.
Compared with Algorithm 1, this method improves the time complexity by an order of magnitude, which is O(N), but it is not only an optimization in time complexity, this method only needs to IO the data file once, while Algorithm 1 The number of IOs is high, so Algorithm 2 has better operability in engineering than Algorithm 1.
Step 2: Find the Top 10 (find the 10 that appear most frequently)
Algorithm 1: Ordinary sorting (we only need to find the top10, so all sorting is redundant)
I think everyone is familiar with the sorting algorithm, so I won’t go into details here. What we need to note is that the time complexity of the sorting algorithm is NlgN. In this question, three million records can be saved with 1G of memory.
Algorithm 2: Partial Sorting
The question requirement is to find the Top 10, so we do not need to sort all the Query. We only need to maintain an array of 10 sizes, initialize 10 Query, and sort each Query from large to small according to the number of statistics. , and then traverse the 3 million records. Each time a record is read, it is compared with the last Query in the array. If it is smaller than this Query, then continue traversing. Otherwise, the last data in the array is eliminated (it still needs to be placed in the appropriate position to keep it there. sequence), add the current Query. Finally, when all the data has been traversed, the 10 Queries in this array will be the Top10 we are looking for.
It is not difficult to analyze that in this way, the worst time complexity of the algorithm is N*K, where K refers to the top number.
Algorithm 3: Heap
In Algorithm 2, we have optimized the time complexity from NlogN to N*K. I have to say that this is a relatively big improvement, but is there a better way?
Analyze it, in Algorithm 2, after each comparison is completed, the required operation complexity is K, because the elements need to be inserted into a linear table, and sequential comparison is used. Let us note here that the array is ordered. Every time we search, we can use the binary method to search. In this way, the complexity of the operation is reduced to logK. However, the subsequent problem is data movement, because movement The number of data has increased. However, this algorithm is still improved over Algorithm 2.
Based on the above analysis, let’s think about it, is there a data structure that can both search and move elements quickly?
The answer is yes, that is heap.
With the help of the heap structure, we can find and adjust/move in log time. So at this point, our algorithm can be improved to this, maintaining a small root heap of size K (10 in this question), then traversing 3 million Query, and comparing them with the root elements respectively.
The idea is consistent with the above-mentioned Algorithm 2, except that in Algorithm 3, we use a data structure such as a minimum heap instead of an array, which reduces the time complexity of finding the target element from O(K) to O(logK).
So, using the heap data structure and Algorithm 3, the final time complexity is reduced to N*logK. Compared with Algorithm 2, there is a relatively large improvement.
At this point, the algorithm is completely over. After the above first step, first use the Hash table to count the number of occurrences of each Query, O(N); then in the second step, use the heap data structure to find the Top 10, N* O(logK). So, our final time complexity is: O(N) + N'*O(logK). (N is 10 million, N' is 3 million).
How does js use heap to implement Top K algorithm?
1. Use heap algorithm to implement Top, the time complexity is O(LogN)
function top(arr,comp){ if(arr.length == 0){return ;} var i = arr.length / 2 | 0 ; for(;i >= 0; i--){ if(comp(arr[i], arr[i * 2])){exch(arr, i, i*2);} if(comp(arr[i], arr[i * 2 + 1])) {exch(arr, i, i*2 + 1);} } return arr[0]; } function exch(arr,i,j){ var t = arr[i]; arr[i] = arr[j]; arr[j] = t; }
2. Call the heap implementation K times, the time complexity is O(K * LogN)
function topK(arr,n,comp){ if(!arr || arr.length == 0 || n <=0 || n > arr.length){ return -1; } var ret = new Array(); for(var i = 0;i < n; i++){ var max = top(arr,comp); ret.push(max); arr.splice(0,1); } return ret; }
3. Test
var ret = topK(new Array(16,22,91,0,51,44,23),3,function (a,b){return a < b;}); console.log(ret);
The above is the implementation of Top K algorithm using heap. What is Top K algorithm? I hope it will be helpful to everyone's learning.