Regular expression is a logical formula for string operations. It is an important and complex technology when processing text data. So how to quickly master regular expressions? The following article recommends a learning method: through AST. I hope to be helpful!
# Regular expressions are basically used to process strings. It is very convenient to use them for string matching, extraction, replacement, etc.
However, learning regular expressions is still somewhat difficult, such as concepts such as greedy matching, non-greedy matching, capturing subgroups, and non-capturing subgroups. It is not only difficult for beginners to understand, but also for many people who have worked for several years. Don't understand.
How to learn regular expressions better? How to quickly master regular expressions?
Recommend a way to learn regular rules that I think is very good: Learn through AST.
The matching principle of regular expressions is to parse the pattern string into AST, and then use this AST to match the target string.
Various information in the pattern string will be saved in the AST after parse. AST is an abstract syntax tree. As the name suggests, it is a tree organized according to a grammatical structure. From the structure of AST, you can easily know the syntax supported by regular expressions.
How to view the AST of a regular expression?
You can view it visually through the website astexplorer.net:
Switch the language of parse to RegExp, and you can do regular expressions Visualization of the AST of an expression.
As mentioned before, AST is a tree organized according to grammar, so various grammars can be easily sorted out from its structure.
Then let’s learn various syntaxes from the perspective of AST:
Let’s start with the simple one, /abc/ The regular expression can match the string 'abc', and its AST is as follows:
3 Char, the values are a, b, c, and the type is simple. The subsequent matching is to traverse the AST and match these three characters respectively.
We used the exec API to test:
The 0th element is the matched string, and index is the starting subscript of the matched string. input is the input string.
Let’s try special characters again:
/\d\d\d/ means matching three numbers,\ d is a metacharacter (meta char) with special meaning supported by regular expressions.
We can also see from AST that although they are also Char, their type is indeed meta:
You can match any metacharacter through \d Number:
Which is meta char and which is simple char can be seen at a glance through AST.
Regular supports specifying a group of characters through [], which means that any one of the characters will be matched.
We can also see from AST that it is wrapped with a layer of CharacterClass, which means character class, that is, it can match any character it contains.
This is indeed the case in the test:
Regular expressions support specifying how many times a character is repeated, using the form {from,to},
For example, /b{1,3}/ means character b is repeated 1 to 3 times, /[abc] {1,3}/ means that this a/b/c character class is repeated 1 to 3 times.
As can be seen from AST, this syntax is called Repetition:
It has a quantifier attribute to represent the quantifier, and the type here is range , from 1 to 3.
Regular also supports the abbreviations of some quantifiers, such as 1 to countless times, * 0 to countless times, ? 0 or 1 times.
are different types of quantifiers:
Some students may ask, what does the greedy attribute here mean?
greedy means greedy. This attribute indicates whether this Repetition is a greedy match or a non-greedy match.
If you add a ? after the quantifier, you will find that greedy becomes false, which means switching to non-greedy matching:
Then greedy and What does non-greed mean?
Let’s see an example.
The default Repetition matching is greedy and will continue to match as long as the conditions are met, so acbac can be matched here.
Add a ? after the quantifier to switch to non-greedy, and only the first one will be matched:
This is greedy matching and non-greedy matching. Through AST, we can clearly know that greedy and non-greedy are for repeated grammar. The default is greedy matching. Add a ? after the quantifier to switch to non-greedy.
Regular expression supports returning part of the matched string into a subgroup through ().
Look through the AST:
The corresponding AST is called Group.
And you will find that it has a capturing attribute, the default is true:
What does this mean?
This is the syntax for subgroup capture.
If you don’t want to capture the subgroup, you can write like this (?:aaa)
Look, capturing becomes false.
What is the difference between capture and non-capture?
Let’s try:
Oh, it turns out that the capturing attribute of Group represents whether to extract or not.
We can see from the AST that capture is for subgroups. The default is capture, which means the content of the subgroup is extracted. You can switch to non-capture through ?: and it will not be extracted. The content of the subgroup is gone.
We are already familiar with using AST to understand regular syntax. Let’s look at something a bit more difficult:
Regular expression The formula supports the syntax of (?=xxx) to express lookahead assertions, which are used to determine whether a certain string is preceded by a certain string.
You can see through AST that this syntax is called Assertion, and the type is lookahead, that is, looking forward, only matching the previous meaning:
This What does it mean? Why do you write this? What is the difference between /bbb(ccc)/ and /bbb(?:ccc)/?
Let’s try:
It can be seen from the results:
/bbb(ccc)/ matches the subgroup of ccc and This subgroup was extracted because the default subgroup is captured.
/bbb(?:ccc)/ matches the subgroup of ccc but is not extracted because we pass ?: to set the subgroup not to capture.
/bbb(?=ccc)/ The subgroup matching ccc is not extracted, indicating that it is also non-capturing. The difference between it and ?: is that ccc does not appear in the matching result.
This is the nature of lookahead assertion: Lookahead assertion means that a certain string is preceded by a certain string, the corresponding subgroup is not captured, and the asserted string will not appear in the matching results.
If it is not followed by that string, it will not match:
After changing ?= to ?!, the meaning changes. Take a look through the AST:
Although the lookahead assertion is still asserted first, there is an additional negative attribute of true.
The meaning is very obvious. Originally, it means that the front is a certain string. After negation, it means that the front is not a certain string.
The matching result is just the opposite:
#Now it will match only if the preceding string is not a certain string. This is a negative look-ahead assertion.
If there is a preceding assertion, there will naturally be a trailing assertion, that is, it will match only if it is followed by a certain string.
Similarly, it can also be denied: The AST corresponding to
(?
##(?Look-ahead assertion and look-behind assertion are the most difficult to understand regular expression syntax. Is it much easier to understand if you learn it through AST~SummaryRegular expressions are used to process strings It is a very convenient tool, but it is still somewhat difficult to learn. Many people are confused about syntax such as greedy matching, non-greedy matching, capturing subgroups, non-capturing subgroups, lookahead assertions, and lookbehind assertions. I recommend learning regular rules through AST. AST is an object tree organized according to the syntax structure. Various syntaxes can be easily clarified through the names and attributes of AST nodes. For example, we have clarified it through AST:Repetition syntax (Repetition) is the form of character quantifier. The default is greedy matching (greedy is true), which means matching until no matching. So far, add a ? after the quantifier to switch to non-greedy matching, and stop when one character is matched.
Subgroup syntax (Group) is used to extract a certain string. The default is capturing (capturing is true), which means extraction is required. You can switch to it through (?:xxx) Non-capturing, only matching without extraction.
Assertion syntax (Assertion) represents that there is a certain string before or after it, which is divided into lookahead assertion and lookbehind assertion. The syntax is (?= xxx) and (?
Is it the deep understanding of syntax in various documents or the deep understanding of syntax in the compiler? No need to ask, it must be the compiler! Then it is naturally better to learn grammar through the syntax tree parsed according to the grammar than the document. Regular expressions are like this, and other grammar learning is also like this. If you can learn the grammar using AST, you don’t need to read the documentation. For more node-related knowledge, please visit:nodejs tutorial!
The above is the detailed content of How to quickly master regular expressions? Learn regular grammar through AST!. For more information, please follow other related articles on the PHP Chinese website!