It can be estimated that the size of each file is 5G*64=300G, which is much larger than 4G. So it is impossible to fully load it into memory for processing. Consider a divide and conquer approach.
Traverse the file a, obtain hash(url)%1000 for each url, and then store the urls into 1000 small files (set to a0, a1,...a999) based on the obtained values. among. This way the size of each small file is about 300M. Traverse the file b, and use the same method as a to store the urls into 1000 small files (b0, b1....b999). After this processing, all possible urls that may be the same are in the corresponding small files (a0 vs b0, a1 vs b1....a999 vs b999), and there are no non-corresponding small files (such as a0 vs b99). Same url. Then we only need to find the same url in 1000 pairs of small files.
For example, for a0 vs b0, we can traverse a0 and store the url in hash_map. Then traverse b0. If url is in hash_map, it means that this url exists in both a and b. Just save it to a file.
If the divided small files are uneven and some small files are too large (for example, larger than 2G), you can consider dividing these too large small files into small files in a similar way
The Baidu interviewer asked yesterday. Let’s study it today
The above introduces the given two files a and b, each storing 5 billion urls, each url occupies 64 bytes, and the memory limit is 4G. How to find the common urls of files a and b? , including relevant content, I hope it will be helpful to friends who are interested in PHP tutorials.