就是 现在有一个log 里面全都是访问者的ip(数据很大很大) 一行一个ip
比如
123.22.22.187
123.22.22.182
123.22.21.187
123.22.42.187
123.23.22.187
123.22.22.182
121.22.22.182
....
怎么查出来访问次数最多的10个IP?
方式:
第一步:
如下读取:
声明一个变量 dictionary <string ,long> record
123.22.22.187
123.22.22.182
123.22.21.187
123.22.42.187
123.23.22.187
123.22.22.182
121.22.22.182
先与蓝色部分为key ,key 出现的次数为value 存在 record中。
这样 ,record的最大个数是255个,不算多,更说不上要多少内存。
再来一次大扫除,
使record 剩下 最多的 10 个元素
以这10 个元素为条件,再扫描一次表,把符合条件的数据写到另一个[文件A]中,
第二步:
扫描 [文件A] 以下面数据蓝色部分为key ,以出现次数为value 写到 变量 dictionary <string ,long> record 中
123.22.22.187
123.22.22.182
123.22.21.187
123.22.42.187
123.23.22.187
123.22.22.182
121.22.22.182
。。。
第三步:
。。。
第四步:
。。。
就这样,
用4 步相同的方法,提出最多的数据:10条,
问题,上面这种方式感觉不行啊,是我没理解吗??
比如说:
123,125,196,158,假设这个ip地址出现的次数最多,那么如下,
121,123
121,14
121,56
123,125,196,158,
123,125,196,158,
如果分段比较次数,那么123,125,196,158, 这个ip在第一步就被淘汰了?
The correct answer is this: sort|uniq -c
a.b.c.d
Deal with a first.
All files are divided into multiple reads (such as 255 times).
The records are then stored in 255 files, corresponding to a (such as 1.b.c.d; 2.b.c.d;)
This way you can get the largest ten.
Next, deal with b.
. . .
Finally got the answer.
This kind of problem can use max-min heap.
Basic idea
对所有的ip进行分组,统计出现次数,ip做key,出现次数做val,保存在关联数组中
数组按照键值(出现次数)降序排序,取出前 10 条记录
Although it’s not
java语言描述的
,但javascript描述
, I should be able to understand it. Anyway, I have tested it and it can achieve your effect:The following is the complete example code:
map-reduce or sort | uniq -c
Obviously use hadoop (map-reduce)?
Sort it first, it will be easier to deal with after sorting.
This question should be broken into two parts to understand,
1. Reading a large file at once is inefficient
2. Find the top 10 records with the most occurrences
Question 1: For reading large files, you can use java AIO Implementation method reads files asynchronously
Question 2: LindedList needs to be extended. Each time a record is appended, first check whether a record exists, and if so, push the record position forward (this is more similar to the java memcached caching algorithm)
These types of problems all have the same idea, "divide and conquer".
Each IP is equivalent to a 32-bit integer number. Traverse all the IPs in the log, divide them into n files according to the value range and save them (make sure each file is small enough). In this case, the same IP will definitely be assigned to the same file. In a file; then process these n files one by one, count the top 10 IPs that appear in each file; finally get the top 10 from these 10*n IPs.
Use regular expressions to determine the first number at the beginning of the IP:
The number at the beginning of the line that appears the most is the most frequent IP