Writing about the understanding of simple matching principles, I still focus on PHP.
First of all, regular engines can be mainly divided into two categories: DFA and NFA. Anyway, it is not surprising if you have seen many engines. A simple understanding is that there are different matching methods, just like when looking for data in an array, some start from the beginning. , search, some start from the middle, and use different methods. Relatively speaking, NFA has a longer history, and there are more tools or languages using NFA, but there are also mixed uses of the two engines. The example given in a certain book is very appropriate: NFA is like a gasoline engine, and DFA is like an electric motor. They can both make the car run but use different mechanisms. Since both NFA and DFA have been developed for many years, a standard called POSIX has been introduced, which stipulates the metacharacters and features that a regular engine should support, as well as the accurate results that end users want to get.
NFA matches based on (regular) expressions, while DFA matches based on the string to be matched. PHP uses the traditional NFA engine, and of course Perl does too. Regardless of the engine, there are two general principles: 1. The leftmost result is matched first; 2. Standard quantifiers ( , ?, *, {m,n}) are matched first.
The existing expression is as follows, to match the string 'tonight'
'/to(nite|knite|night)/'
NFA: Expression-led , starting from the first part of the expression, and checking whether the current text matches. If so, continue to the next part of the expression until all expressions match. , the entire match is successful. The first character of the expression is t. It will search repeatedly from left to right in the string until it finds a t character. If it cannot find it, it will fail. If it is found, the next character of the expression is o. Continue to search in the string to be matched, the first two can be matched, enter a selection branch grouped by brackets, match nite, knite or night, it tries these three possibilities in turn. The first branch fails when it tries to nig, the second branch fails when it tries to the first n in the expression, and the third branch happens to match exactly. For an engine dominated by regular expressions, the expression must be checked before the final conclusion can be drawn.
DFA: Unlike NFA, when scanning a string, DFA will record all currently valid matching possibilities. Starting from the very beginning t, it will add a possibility to the current matching possibilities. If it exists, it will keep scanning. When it reaches n in the string, it will record two possibilities, nite and night (it looks at the expression from the perspective of the string to be matched), continue scanning to i or nite and night, and then The only possibility until g is night. When h and t are matched, the engine finds that the scan string has been scanned and reports success (it seems to have a bit of depth and breadth).
Therefore, in general, the text-led DFA engine is faster, and the expression-led NFA must detect all patterns. Before reaching the end of the pattern, it is not known whether the matching is successful or not, even if a previous expression is If the formula matching is successful, you may continue to check it later, which may waste a lot of time. DFA is dominated by strings. It can record at most several possibilities in one place, and the characters in the target text will only be detected once at most.
But the order of the multi-select branches also has a great impact on different target strings. If the branch that happens to match is at the front, it can of course be found faster.
Since NFA is mainly expressions, the difference in expression writing will have a great impact, allowing us to control it more flexibly and with more variability. Among them, for NFA (originally also based on PHP), there is an important feature: Backtracking---When encountering the need to choose from two possible successful matches, choose one first , and remember another one for possible needs later. This situation mainly occurs with standard quantifiers and multiple-choice branches (|).
Stolen picture:
Starting from the expression ('/".*"!/'), first find the double quotation mark A, and then because the period matches any character (the default does not include line breaks, there is no need to consider it here), plus the element The character * means that it can match multiple. Due to the matching priority mechanism of the standard quantifier, it suddenly comes to the end of the string B, because there can be 0, 1 or more * in between, that is to say, this Several forms are likely to match successfully. Therefore, the engine will remember these two states, that is, it may match at a position or it may not match it. As long as it is a place where the * metacharacter passes, it will be recorded from M to the end. .
等到结尾发现没有",于是回溯,需要说明,引擎总是回溯到最近记录的状态(类似栈),一个个往前回溯,直到遇到一个双引号的地方(C),然后匹配匹配到双引号后面(D),发现不是感叹号!,失败,再次回溯(状态记录不为空),到E处又找到了一个双引号,与刚才情况相同,继续匹配(F)发现没有感叹号,失败,继续回溯到G,同样由于后边(H)不是感叹号,仍需要回溯,到I,这时记录的状态已经没有了,也无法继续回溯了,第一轮匹配失败告终,但还没完,引擎的传动装置继续从A处双引号的下一个位置开始继续寻找第一个符合条件的双引号,到J,然后类似上一轮的过程继续上演。最终也没找到 "..."! 这样的字符串,过程却很曲折。
从上面的例子可以看出:一是 .* 这种形式的效率非常低,尤其在失败的时候(当然平常我们那几行代码几乎忽略不计),而且很容易出错,比如用 /".*/" 匹配一对双引号包围的字符串来匹配 ab"cde"fgh""ijk"lmn,最终的结果是"cde"fgh""ijk",最开始的双引号和最末尾的双引号中间的全部内容;二是如果有类似 ((...)*)*、((...)+)*之类的,外面里面同时记录不定状态的,回溯次数呈指数级上升,甚至形成死循环,更加耗时。当然对于改进状态的引擎提前检测这种状况,并报告错误,如同浏览器在自己跳转自己的页面那样报错。
因此在用到量词时,比如+,它可以匹配一个到多个,大于一个时,不是必须的,有两种选择状态,可以有也可以没有,这两种状态就是日后可能会在回溯中用到的状态,称为备用状态,?也是如此。
对于匹配双引号及之间的字符,中间不包括转义后的双引号的情况,我们可以使用忽略优先,'/".*?"/' ,忽略优先按量词的最小限度匹配, *最小是没有,相当于先检测空串,没有先匹配一个字符马上检测双引号,这在一定只要检测到右边有一个双引号匹配即成功结束,它匹配上图中的 "McDonald's"也省去了很多回溯。
当然,对于这种需要检测两个字符间的其他字符,还有一种办法,如 '/"[^"]"/',排除型字符组,它也达到了相同的需求,但情况不总是这样。
比如,匹配与之间的字符,使用 '/[^<\/b>]*<\/b>/' ?注意字符组一次只能匹配一个字符,这里是匹配在和之间的,非<、/、b、>的任意一个字符,字符组不能把里边的字符当一个整体,对于整体、多个字符的检测,可以选择环视。环视与位置相关,生来就是限定周围的字符,一个或多个均可。这里需要否定型环视,因为我们需要的不能是的字符,为了好看中间隔开。
'/<b>( (?!<\/b>) . )*<\/b>/' <span>//</span><span> 否定型环视</span>
但是上面仍能匹配"abcdef",所以还要在其之间排除,中间环视检测?b>,/可以有或没有都行
'/<b> ( (?! <\/? b >) . )* <\/b>/'
上一篇写到了“交还”,在回溯过程中就伴随着交还,如上面的'/".*"!/',因发现双引号后面不是感叹号,而不得不回溯,此时选择另一种备用状态---不匹配刚才匹配到的字符,进行回退,是一个明显的交还。例子:
<span>$pattern_1</span> = '/(\w+)(\d?)/'<span>; </span><span>$pattern_2</span> = '/(\w+)(\d+)/'<span>; </span><span>$pattern_3</span> = '/(\w+)(\d*)/'<span>; </span><span>$subject</span> = 'abc12345'<span>; </span><span>preg_match</span>(<span>$pattern_1</span>, <span>$subject</span>, <span>$match_1</span><span>); </span><span>preg_match</span>(<span>$pattern_2</span>, <span>$subject</span>, <span>$match_2</span><span>); </span><span>preg_match</span>(<span>$pattern_3</span>, <span>$subject</span>, <span>$match_3</span><span>); </span><span>echo</span> 'match=><pre class="brush:php;toolbar:false">'<span>; </span><span>var_dump</span>(<span>$match_1</span>, <span>$match_2</span>, <span>$match_3</span>);
使用捕获型括号,分组,引擎记住两个括号中匹配的文本。结果如下:
从上到下依次为$match_1、$match_2、$match_3,由于\w与\d有公共部分,而且两个量词都是匹配优先,从结果来看,前一个+量词匹配得最多(键值为1的项),pattern_1中,\d?没有匹配,pattern_2中,\d+只匹配了一个,pattern_3中,\d*没有匹配,而它们讷的过程都类似于,先让\w+匹配到末尾,然后引擎看剩余的模式,\d?可有可无,那就无,因为无正则匹配也是成功,不交还。\d+必须匹配一个,否则将导致匹配失败,这里它会交还一个,因为它必须服从使得全局匹配的成功。对于\d*也是如此,可以不匹配,不交还。
假如这里的pattern是'/(\w+)(\d\d)/',那么它就必须交还两个数字,如果没有交还或是带匹配字符串中没有,就只能报告失败。所以有两个规律:
1. 先来先服务原则,匹配优先的量词在前,先尽量满足自己;
2. 必须服从全局匹配结果,有造成失败的可能,引擎会强迫匹配优先量词交还,否则整个匹配失败。
如果不交还呢?就会提前报告失败。必须谈到占有优先量词和固化分组,以+为例形式是 \w++、 (?>\w+)
<span>$pattern</span> = '/\w+:/'<span>; </span><span>$subject</span> = 'abcdefghijk';
如例,我们人当然能一眼看出来,字符串结尾没有冒号肯定失败,但是程序不会这样,它会一轮又一轮的匹配-回溯,因为它记住了一些可选择性的状态,但现在我们已经明确知道了这些状态是没用的,回溯也是白费力气,回溯前就已经失败了。所以可以 \w++: 或者 (?>\w+): ,有了占有优先或者固化分组,这些可选择性的状态都会被抛弃,\w+一直匹配到字符串结尾,单词没了,然后检测冒号存不存在,冒号不存在,但是现在有没有可供回溯的状态,立即报告失败,如果是几十万行的文本会节省很多时间。
需要注意的是,固话分组或是占有优先对嵌套在里面的量词也是有作用的,这点跟?:只分组不捕获不同
<span> (</span>?> (\d+)+ ) <span>//</span><span> 里面的\d+的状态也会被抛弃</span> (?: abc (\d\d) ) <span>//</span><span> 外层的括号不会被捕获,但里面的括号不受影响,\1仍记录着数字字符</span>
最后记录下选择分支的一个坑,例
<span>$pattern</span> = '/cat/'<span>; </span><span>$subject</span> = 'indicate big cat';
cat当然会匹配indicate中间的cat,因为它在前面。再看这个
<span>$pattern</span> = 'fat|cat|belly|your'<span>; </span><span>$subject</span> = 'The dragging belly indicates that your cat is too fat';
NFA以表达式为主导,从表达式的角度看字符串,因此先检测到的是fat,结果是fat吗?结果仍是cat!所以NFA的引擎始终优先匹配选择分支选择最左端的匹配结果,哪怕它位于选择分支的后边。
这也说明,NFA引擎的正则,只要表达式中还存在可能的多选分支,正则引擎会回溯到存在尚未尝试的多选分支的地方,这个过程不断重复,直到完成全局匹配,如果不是这样,上例中fat先匹配成功就已经作为结果返回了。多选状态不是匹配优先,也不是忽略优先,但是尝试是从左到右。而对于DFA和某些POSIX NFA,匹配的不是最靠字符串左端的文本,而是选择分支中最长的分支。
正则的细节太多,扯不清楚,还是得看书理解,尤其是对于正则的调校,以及某些常用的判断技巧,比如匹配" "包围的字符串,中间可以有\"和其他转义序列,还是很麻烦的,推荐《精通正则表达式》,中文翻译挺棒,读起来也不生硬,而且还有很多技巧性的东西,比如“消除循环”是一大利器。end