Background
When analyzing the logs, I found that some log parameters contained other URLs, for example:
##Extract the URL (xss.ha.ckers.org) in the request parameters, and then compare it with the threat intelligence database. If it hits the blacklist, it will be blacklisted. If it is not in the blacklist or the company's whitelist, you can mark it first and focus on analysis later. Extract URLThere are many articles on the Internet about URL extraction, most of which use regular expressions. The method is simple but not very accurate. I provide a method here: use lexical analysis to extract domain names and IPs. The idea is borrowed from this article:https://blog.csdn.net/breaksoftware/article/details/7009209. If you are interested, you can take a look. Facts have proved that following the master really improves your posture.
The original text is in C version, here I wrote a similar one in Python for your reference. Common URL classification Observation can be seen: the IP form of URL structure is the simplest: 4 numbers less than 255 are divided by.; domain form comparison Complex, but they have something in common: they all have the top-level domain name .com. Define legal characters: Top-level domain name list: Domain name form extraction: such aswww.baidu.com.
while (i < len(z) and z[i].isdigit()): i = i + 1 ip_v1 = True reti = i if i < len(z) and z[i] == '.': i = i + 1 reti = i else: tokenType = TK_OTHER reti = 1while (i < len(z) and z[i].isdigit()): i = i + 1 ip_v2 = True if i < len(z) and z[i] == '.': i = i + 1 else: if tokenType != TK_DOMAIN: tokenType = TK_OTHER reti = 1while (i < len(z) and z[i].isdigit()): i = i + 1 ip_v3 = True if i < len(z) and z[i] == '.': i = i + 1 else: if tokenType != TK_DOMAIN: tokenType = TK_OTHER reti = 1while (i < len(z) and z[i].isdigit()): i = i + 1 ip_v4 = True if i < len(z) and z[i] == ':': i = i + 1 while (i < len(z) and z[i].isdigit()): i = i + 1 if ip_v1 and ip_v2 and ip_v3 and ip_v4: self.urls.append(z[0:i]) return reti, tokenType else: if tokenType != TK_DOMAIN: tokenType = TK_OTHER reti = 1
Scan the first half of 1234, which conforms to the characteristics of the IP form, but it is found that the code will report an exception, so the IP processing code segment needs to be added to determine whether the suffix is a top-level domain name:
https://github.com/skskevin/UrlDetect/blob/master/tool/domainExtract/domainExtract.py
The above is the detailed content of Use lexical analysis to extract domain names and IPs. For more information, please follow other related articles on the PHP Chinese website!