Last time many friends wrote about using regular expressions for text blocking, it’s not that I don’t want to use regular expressions (I don’t use regular expressions very much, as anyone who has seen my previous crawlers knows, I directly use BeautifulSoup’s webpage tags to find content. Because it is easy to understand and convenient,), but it is difficult to master the regular table (anyone who has seen the regular table should know that there are many method rules corresponding to the symbols in it, which is very flexible), for friends who have not been exposed to programming for a long time It is very likely that a lot of time will be wasted in the programming process. Today I will briefly introduce the frequently used regular expressions. Unless they are very special, they will basically be used.
1. A brief introduction to regular expressions
First you have to import the regular expression method import re Regular expression is a powerful tool for processing strings and has its own independent processing The mechanism may not be as efficient as str's own method, but its function is very flexible and powerful. Its operation process is to first define a matching rule ("the content you want + regular grammar rules"), put in the string to be matched, and then retrieve the information you want through the internal mechanism of the regular rules.
2. Several commonly used postures of findall
The basic structure is roughly: nojoke = re.findall(r'matching rules','the desired string to be retrieved ') nojoke is the result we finally returned through regularity. re regularity findall searches for all r flags that represent statements that are followed by regularity (so it is easy to check when there is a lot of code). Let's look at a few examples to understand more
This code is to find all the bi characters in the search string and return them in the form of a list. This is often used to count the number of occurrences of unified characters. Continue to look at the next one
The symbol ^ is added here to indicate that the string that matches the string starting with abi is returned. You can also determine whether the string starts with abi.
The $ symbol is used here to indicate that the string ending with gbi is returned to determine whether the string ends.
Here [...] means matching the values of a and f, or b and f, or c and f in the brackets to return a list.
"\d" is a regular syntax rule used to match numbers between 0 and 9 and return a list. It should be noted that 11 will be treated as the strings '1' and ' 1' returns instead of returning the string '11'. Remember to use it incorrectly and there will be a big pitfall.
Of course, the solution is to write as many \d as you want. The above demonstrates how to get 3 digits in a string. Here is a flexible regular expression. aspect.
The small d here means taking the numbers 0-9, and the big D means no numbers are needed, that is, content other than numbers is returned.
"\w" in the regular expression represents matching from lowercase a to z, uppercase A to Z, and the numbers 0 to 9 include the first three types, as printed above .
"\W" in regular expressions means matching special symbols other than letters and numbers, but when using \slashes here, you should pay attention to the fact that \ is escaped in the string Please go to Baidu to learn the specific symbols.
The usage of brackets () here means that the matching is to take the content inside the brackets. Here.* is the regular greedy matching syntax. The key point is to maximize the greedy benefit and the maximum range of matching criteria, as shown in the figure above.
A question mark is added here.*? It is to limit it from matching to the maximum range, which is also called non-greedy pattern matching. The result is to match the contents of the two p's and return them.
Add re.I (capital i) here to indicate matching, regardless of the case of male or female. Otherwise, the above match will occur if there are upper and lower case characters behind it. Not found returns an empty list to you.
The trouble here is \n, commonly known as line break. Once the line breaks, the program will no longer recognize it, so we added re.S (capital) to represent Rather than matching all characters including line breaks and returning them, basically after you learn the above syntax and usage, you can get more than 70% of the matching methods. Of course, there are many methods that I won’t list. You can learn by yourself (the rest I rarely use the basics).
2. Usage and difference between match and search:
re.match tries to match a pattern from the starting position of the string. If it is not the starting position, the match is successful. If so, match() returns none. re.search scans the entire string and returns the first successful match. It's easy to understand if you look at the code. As follows:
Here, print the end directly and add .span() to get the position of the matching string returned as a tuple (starting position, ending position), there is one Not written because it returns null and the compiler will report an error.
Is it clear at a glance? match will only match the beginning. If it cannot find it, it will return None. I did not add .group() here because the return value is a null value. I added it. The compiler will report an error, and search will scan the entire string without being picky. Of course, you can also use the above regular method to match it. I won’t introduce too much here and you can practice it.
3. How to replace sub Replacement string', the string that needs to be retrieved)
This reflects the result very intuitively. Replace the # sign and the following string with the string you want to change. .
4. Final benefits
Before giving the final benefits, I hope everyone can practice the above usage and rules. Only by making more mistakes and summarizing can you accumulate experience. , the last benefit is to tell you some commonly used email matching rules as follows:
The above is the detailed content of Illustrated guide to using python regularization. For more information, please follow other related articles on the PHP Chinese website!