Tencent has such a school recruitment question:
For a PC with only 2G of memory, find the median in a file containing 10G of integers and write an algorithm.
There are several solutions, such as bucket sorting and radix sorting, but I found a method that uses a heap to solve it. I don’t quite understand it. The original words are like this:
First find the 1G largest, and then Use this element to find the 2G largest, and then use the 2G largest to find the 3G largest... Of course, although this does not require sorting, there will be more disk operations, and the specifics need to be analyzed. Which one of the lower and outer sorting will have more disk IO?
Create a maximum heap of 1g integers. If the element is less than the maximum value, put it into the heap. In this way, you can get the 1gth largest element and then use this element to re- Build a heap once, and add the 1gth largest element greater than the conditions for entering the heap this time, so that after building the heap, you can get the 2gth largest element.
I was confused when I saw it. First, I didn’t understand what 1G was. Is it to divide the 10G data into 10 parts, and then sort the 1G data to find the largest meaning?
I hope you can give me some advice on this solution, thank you.
Heap is a data structure, which is a complete binary tree (so no pointer is needed to save its structure). Each node is no larger than or no smaller than its child nodes, corresponding to the large top heap and the small top heap respectively. For example, in a small top heap, the root element is the smallest. So it can be used as
O(1)
的复杂度查找其中的最大或是最小的元素。其删除和插入元素的复杂度,最坏情况下都是O(log(N))
.This passage means that by first creating a small top heap that can accommodate (1G+1) data, you can quickly find the smallest element in it. Then traverse all 10G of data and insert it: if the heap is full after insertion (there are 1G+1 data in the heap). Just delete the smallest one. . . .
The final result is that the data in the heap is the largest 1G data, and the top of the heap is the smallest one, that is, if sorted in descending order, the position of this data is the 1G position, for example, this number is x.
Then, continue the above process again, but the difference is that only elements less than x are inserted into the heap, and elements greater than or equal to x are ignored. That is, after ignoring the 1G data just found, find the largest 1G. In this way, the top of the final heap is the data at the 2G position after sorting. (But be careful when dealing with duplicate data). . . . And so on to find the median.
I have a question, can’t this kind of question be done using bitmaps? . .
1. Traverse these 10G numbers, mark which ones have appeared in the bitmap, and count how many different numbers have appeared, record it as count
2. Traverse the bitmap, find the count/2th bit, which is this number. . .
If the integer is 4byte like 0 ~ 2^32-1. Then to save its value range, you only need a bitmap of 512M memory. . . . ?
I still have some divergent questions about this method. If the masters find it interesting, you can take a look:
Since it is divided into 10 heaps, what is the common boundary value of each heap?
If divided into ten piles, how many judgments does quick sort require in the worst case?
Is it possible to reduce the amount of calculation?
@zonxin