source_ip = line.split('- -')[0].strip()
if re.match('[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}',source_ip):
if source_ip_dict.get(source_ip,'-')=='-':
source_ip_dict[source_ip]=1
else:
source_ip_dict[source_ip]=source_ip_dict[source_ip]+1
Extract the apache log IP through the above code, and perform statistical deduplication.
The extracted IP data is as follows:
So how to name and classify these IP addresses,
For example,
202.108.11.103 and 220.181.32.137 are Baidu Spider IPs
The effect you want to achieve is as follows
The two IPs are named Baidu Spider, and then add their statistics together, that is 4336 3411
Baidu Spider 7747
How to do this
You can try to build a large dictionary with the dictionary as the key and the crawler name as the value;
Pivot table using pandas
How tiring it is!
Why not create a separate table for this IP group, named IPGroup (id, ip, groupname)
After that, it can be done with just one SQL, how easy it is (let the poster use IPStastics)