Matching mode
JDK provides three matching modes: greedy, reluctant and possessive, which respectively correspond to three possessive quantifiers. Greedy mode is the default mode and reluctant mode. Indicated by adding a ? after the expression. Possession mode is indicated by appending a + to the end of the expression.
What are the meanings of the three modes?
The meaning of the greedy mode is: match as many matches as possible while trying to satisfy the overall match.
The meaning of reluctant mode is: matching as little as possible while also trying to satisfy the overall match.
The meaning of possession mode is: match as many as possible. If the arrangement cannot match due to too many matches, there will be no backtracking.
For example, there is a string as follows:
/m/t/wd/nl/n/p/m/wd/nl/n/p/m/wd/nl/n/p/m/v/n
Expression matching in greedy mode:
/m/t.*/nl/n/p/m
此时匹配结果为 /m/t/wd/nl/n/p/m/wd/nl/n/p/m/wd/nl/n/p/m
Expression matching in reluctant mode:
/m/ t/.*?/nl/n/p/m
此时匹配结果为 /m/t/wd/nl/n/p/m
/m/t/wdx+?/nl/n/p/m
If this is the case, then it will not match, because + means at least matching One, reluctant mode, must match at least one, so the match fails.
Expression matching of occupancy pattern:
/m/t.*+/nl/n/p/m It cannot be matched at this time because .* matches too many characters, which makes it impossible to match later.
Note: Only forced quantifiers or possessive quantifiers can be used for variable matching rules. For example, X?? means matching the character X as little as possible, while X? is the default greedy mode, which means matching as much as possible. Another example: X{n} means that you must prepare to match n nature.
Looking is suitable for such scenarios: when doing regular matching, you need to know whether there are specific expressions before or after the matched part, without capturing (consuming) these specific expressions.
If you do not use lookaround, but directly use expressions to judge, then these matched expressions will inevitably be consumed.
For example: Suppose I want to segment the sentence ILoveYou. The principle is that if a capital letter appears, it is considered a new word.
If you use this matching rule:
\p{Upper}\p{Lower}*[\p{Upper}]?
, then the matched uppercase letters will be consumed. The matching result would be:
IL
You
This does not meet the requirements.
The solution is to use lookaround. The regular expression is:
\p{Upper}?\p{Lower}*(?=[\p{Upper}]?)
The output result is:
I
Love
You
There are four types of lookaround:
(?=X) means that what follows is a regular expression Formula X, when matching the previous part, the part X will not be consumed and will not be captured. Zero-width forward positive prediction.
(?<=X) means that the previous part is the regular expression X. When matching the following part, the X part will not be consumed and will not be captured. Zero-width reverse positive prediction.
(?!X) means that what follows is not the regular expression X. When matching the previous part, the X part will not be consumed and will not be captured. Zero-width forward negative prediction.
(?!=X) means that the preceding part is not the regular expression X. When matching the following part, the X part will not be consumed and will not be captured. Zero-width backward negative prediction.
Non-capturing possessive matching
(?>X) This has not been studied clearly.