The example of this article analyzes the definition and usage of backtracking in regular expressions. I would like to share it with you for your reference. The details are as follows:
This is my first contact with "backtracking" and I don't know much about it. Now I will record what I understand as a mental virtue for future reference.
The matching basis of the regular expressions we use is roughly divided into: giving priority to the leftmost (most beginning) matching results and standard matching quantifiers (*, +, ? and {m, n}) Matches take precedence.
"Prefer the leftmost match", as the name suggests, is to match from the beginning of the string until the end of the match. This is the basis; "standard matching quantifier" is divided into "non-deterministic finite automaton (NFA)" )" can also be called "expression-led"; the other is "deterministic finite automaton (DFA)" which can also be called "text-led". The regular expressions we currently use in JavaScript are "expression-led". Expression dominance and text dominance are a bit troublesome to explain. It may be clearer if we look at an example first.
// 使用正则表达式匹配文本 var reg = /to(nite|knight|night)/; var str = 'doing tonight'; reg.test(str);
In the above example, the first element [t], it will try repeatedly until 't' is found in the target string . After that, it checks whether the following character can be matched by [o], and if so, it checks the following element (nite|knight|night). Its real meaning is "nite" or "knight" or "night". The engine will try these three possibilities in sequence. The process of trying [nite] is to try [n] first, then [i], then [t], and finally [e]. If this attempt fails, the engine tries another possibility, and so on, until a match is successful or a failure is reported. Control in an expression is transferred between different elements, hence the name "expression dominance".
The same is the above example. When scanning a string, "text-led" will record all currently valid matching options. When the engine moves to t, it adds a potential possibility to the match possibilities it is currently processing:
Each character scanned next updates the current possibility Match sequence. After continuing to scan two characters, the situation is:
The valid possible matches become two (knight is eliminated). When g is scanned, there is only one possible match left. When h and t are matched, the engine finds that the match has been completed and reports success. "Text-dominated" because every character in the string it scans controls the engine.
If you want to understand how "expression leadership" works, then you need to take a look at our topic today "backtracking". Backtracking is like taking a fork in the road. When you encounter a fork in the road, first make a mark at each intersection. If you hit a dead end, you can go back the same way you came until you come across a sign you made before, marking a road you haven't tried yet. If you can't go that way, you can continue to go back, find the next mark, and repeat until you find a way out, or until you complete all the untried roads.
In many cases, the regex engine must choose between two (or more) options. When encountering /…x?…/, the engine must try to match X or not. For the case of /... After the first X matches, this requirement has been met and a decision needs to be made whether to try the next X. If you decide to proceed, you also need to decide whether to match the third X, the fourth X, and so on. Every time you choose, you actually make a mark to remind you that there is another possible choice here, which is reserved for future use. There are two important points to consider during the backtracking process: Which branch should be chosen first? Which (or which) previously saved branches are used when backtracking?
The first question is chosen according to the following important principle:
If you need to choose between "make an attempt" and "pass through attempt", for the matching priority quantifier, the engine will "Make an attempt" is chosen in preference, whereas for ignoring the precedence quantifier "pass-by attempt" is chosen.
The second question is based on the following principle:
The most recently stored option is the one returned when local failure forces backtracking. The principle used is LIFO (last in first out).
Let’s first look at a few examples of marking roads:
1. Matching without backtracking
Use [ab?c] to match "abc" . [a]After matching, the current status of the match is as follows:
现在轮到[b?]了,正则引擎需要决定:是需要尝试[b]呢,还是跳过?因为[?]是匹配优先的,它会尝试匹配。但是,为了确保在这个尝试最终失败之后能够恢复,引擎会把:
添加到备用状态序列中。也就是说,稍后引擎可能从下面的位置继续匹配:从正则表达式中的[b?]之后,字符串的c之前(也就是说当前的位置)匹配。这实际上就是跳过[b]的匹配,而问题容许这样做。引擎做好标记后,就会继续向前检查[b]。在示例中,它能够匹配,所以新的当前状态变为:
最终的[c]也能成功匹配,所以整个匹配完成。备用状态不再需要了,所以不再保存它们。
2、进行了回溯的匹配
下面要匹配的文本是“ac”,在尝试[b]之前,一切都与之前的过程相同。显然,这次[b]无法匹配。也就是说,对[……?]进行尝试的路走不通了。因为有一个备用状态,这个“局部匹配失败”产工会导致整体匹配失败。引擎会进行回溯,也就是说,把“当前状态”切换为最近保存的状态。
在[b]尝试之前保存的尚未尝试的选项。这时候,[c]可以匹配c,所以整个匹配宣告完成。
3、不成功的匹配
现在要匹配的文本是“abx”。在尝试[b]以前,因为存在问号,保存了这个备用状态:
[b]能够匹配,但这条路往下却走不通了,因为[c]无法匹配x。于是引擎会回溯到之前的状态,“交还”b给[c]来匹配。显然,这次测试也失败了。如果还有其他保存的状态,回溯会继续进行,但是此时不存在其他状态,在字符串中当前位置开始的整个匹配也就宣告失败。
例子1: 提取字符串 提取 da12bka3434bdca4343bdca234bm 提取包含在字符a和b之间的数字,但是这个a之前的字符不能是c,b后面的字符必须是d才能提取。
例如这里就只有3434这个数字满足要求。那么我们怎么提取呢?
首先我们写出提取这个字符串的表达式: (?
Java代码片段如下:
Pattern p = Pattern.compile( "(?<!c)a(//d+)bd " ); Matcher m = p.matcher( "da12bka3434bdca4343bdca234bm" ); while (m.find()){ System.out.println(m.group( 1 )); //我们只要捕获组1的数字即可。结果 3434 System.out.println(m.group(0)); // 0组是整个表达式,看这里,并没有提炼出(?<!c)的字符 。结果 a3434bd }
例子2: 将一些多位的小数截短到三位小数:\d+\.\d\d[1-9]?\d+
在这种条件下 6.625 能进行匹配,这样做没有必要,因为它本身就是三位小数。最后一个“5”本来是给 [1-9] 匹配的,但是后面还有一个 \d+ 所以,[1-9] 由于是“?”可以不匹配所以只能放弃当前的匹配,将这个“5”送给 \d+ 去匹配,如果改为:
\d+\.\d\d[1-9]?+\d+
的侵占形式,在“5”匹配到 [1-9] 时,由于是侵占式的,所以不会进行回溯,后面的 \d+ 就匹配不到任东西了,所以导致 6.625 匹配失败。
这种情况,在替换时就有效了,比如把数字截短到小数点后三位,如果正好是三位小数的,就可以不用替换了,可以提高效率,侵占量词基本上就是用来提高匹配效率的。
把 \d+\.\d\d[1-9]?+\d+ 改为 \d+\.\d\d(?>[1-9]?)\d+ 这样是一样的。
希望本文所述对大家JavaScript程序设计有所帮助。
更多Backtracking definition and usage analysis in regular expressions [JS and java implementation]相关文章请关注PHP中文网!