正则表达式(Regular Expression,缩写为regexp,regex或regxp),又称正规表达式、正规表示式或常规表达式或正规化表示法或正规表示法,是指一个用 来描述或者匹配一系列符合某个句法规则的字符串的单个字符串。在很多文本编辑器或其他工具里,正则表达式通常被用来检索和/或替换那些符合某个模式的文本 内容。许多程序设计语言都支持利用正则表达式进行字符串操作。例如,在Perl中就内建了一个功能强大的在正则表达式引擎。正则表达式这个概念最初是由 Unix中的工具软件(例如sed和grep)普及开的。(摘自维基百科)
PHP同时使用两套正则表达式规则,一套是由电气和电子工程师 协会(IEEE)制定的POSIX Extended 1003.2兼容正则(事实上PHP对此标准的支持并不完善),另一套来自PCRE(Perl Compatible Regular Expression)库提供PERL兼容正则,这是个开放源代码的软件,作者为 Philip Hazel。
使用POSIX兼容规则的函数有:
ereg_replace()
ereg()
eregi()
eregi_replace()
split()
spliti()
sql_regcase()
mb_ereg_match()
mb_ereg_replace()
mb_ereg_search_getpos()
mb_ereg_search_getregs()
mb_ereg_search_init()
mb_ereg_search_pos()
mb_ereg_search_regs()
mb_ereg_search_setpos()
mb_ereg_search()
mb_ereg()
mb_eregi_replace()
mb_eregi()
mb_regex_encoding()
mb_regex_set_options()
mb_split()
使用PERL兼容规则的函数有:
preg_grep()
preg_replace_callback()
preg_match_all()
preg_match()
preg_quote()
preg_split()
preg_replace()
定界符:
POSIX兼容正则没有定界符,函数的相应参数会被认为是正则。
PERL兼容正则可以使用任何不是字母、数字或反斜线(\)的字符作为定界符,如果作为定界符的字符必须被用在表达式本身中,则需要用反斜线转义。也可以使用(),{},[] 和 <> 作为定界符
修正符:
POSIX兼容正则没有修正符。
PERL兼容正则中可能使用的修正符(修正符中的空格和换行被忽略,其它字符会导致错误):
i (PCRE_CASELESS):
匹配时忽略大小写。
m(PCRE_MULTILINE):
当设定了此修正符,行起始(^)和行结束($)除了匹配整个字符串开头和结束外,还分别匹配其中的换行符(\n)的之后和之前。
s(PCRE_DOTALL):
如果设定了此修正符,模式中的圆点元字符(.)匹配所有的字符,包括换行符。没有此设定的话,则不包括换行符。
x(PCRE_EXTENDED):
如果设定了此修正符,模式中的空白字符除了被转义的或在字符类中的以外完全被忽略。
e:
如果设定了此修正符,preg_replace() 在替换字符串中对逆向引用作正常的替换,将其作为 PHP 代码求值,并用其结果来替换所搜索的字符串。 只有 preg_replace() 使用此修正符,其它 PCRE 函数将忽略之。
A(PCRE_ANCHORED):
如果设定了此修正符,模式被强制为“anchored”,即强制仅从目标字符串的开头开始匹配。
D(PCRE_DOLLAR_ENDONLY):
如果设定了此修正符,模式中的行结束($)仅匹配目标字符串的结尾。没有此选项时,如果最后一个字符是换行符的话,也会被匹配在里面。如果设定了 m 修正符则忽略此选项。
S:
当一个模式将被使用若干次时,为加速匹配起见值得先对其进行分析。如果设定了此修正符则会进行额外的分析。目前,分析一个模式仅对没有单一固定起始字符的 non-anchored 模式有用。
U(PCRE_UNGREEDY):
使“?”的默认匹配成为贪婪状态的。
X(PCRE_EXTRA):
模式中的任何反斜线后面跟上一个没有特殊意义的字母导致一个错误,从而保留此组合以备将来扩充。默认情况下,一个反斜线后面跟一个没有特殊意义的字母被当成该字母本身。
u(PCRE_UTF8):
模式字符串被当成UTF-8。
逻辑区隔:
POSIX兼容正则和PERL兼容正则的逻辑区隔符号作用和使用方法完全一致:
[]:包含任选一操作的相关信息。
{}:包含匹配次数的相关信息。
():包含一个逻辑区间的相关信息,可被用来进行引用操作。
|:表示“或”,[ab]和a|b是等价的。
metacharacter is related to "[]":
There are two different sets of metacharacters: one is recognized within the pattern except square brackets, and the other is recognized within square brackets "[]".
Posix-compatible regular and PERL-compatible regular "[]" and "consistent" metacharacters:
Universal escape character with several uses
^ Match the beginning of the string
$ Match the end of the string
? Match 0 or 1
* Match 0 or more characters of the previously specified type
+ Match 1 or more previously specified types The characters
POSIX-compatible regex and PERL-compatible regex "inconsistent" metacharacters "other than []":
. PERL-compatible regex matches any one except a newline character Character
. POSIX compatible regular match any character
POSIX compatible regular and PERL compatible regular "within []" "consistent" metacharacters:
Universal escape character with several uses
^ Negate the character, but only valid when it is the first character
- Specify the character ASCII range, study the ASCII code carefully, you will find that [W-c] is equivalent to [WXYZ\^_`abc]
POSIX-compatible regular expressions and PERL-compatible regular expressions "within []" are "inconsistent" metacharacters:
- The specification of [a-c-e] in POSIX-compatible regular expressions will throw An error occurred.
- The specification of [a-c-e] in PERL compatible regular expressions is equivalent to [a-e].
The number of matches is related to "{}":
POSIX compatible regular expressions and PERL compatible regular expressions are exactly the same in terms of matching times:
{2}: indicates matching The previous character 2 times
{2,}: means matching the previous character 2 or more times, the default is greedy (as many as possible) matching
{2,4}: means matching the previous character 2 times or 4 times
Logical intervals are related to "()":
The area enclosed by () is a logical interval. The main function of the logical interval is to reflect the occurrence of some characters. The logical order, another use is that it can be used for reference (the value in this range can be referenced to a variable). The latter function is rather strange:
$str = "http://www.163.com/";
// POSIX compatible regular:
echo ereg_replace("(. +)","\1",$str);
// PERL compatible regular:
echo preg_replace("/(.+)/", "$1",$str);
// Display two links
?>
When quoting, parentheses can be nested, and the logical order is calibrated according to the order in which "(" appears.
Type matching:
POSIX compatible regular:
[:upper:]: matches all uppercase letters
[: lower:]: Matches all lowercase letters
[:alpha:]: Matches all letters
[:alnum:]: Matches all letters and numbers
[:digit:]: Matches all numbers
[:xdigit:]: Matches all hexadecimal characters, equivalent to [0-9A-Fa-f]
[:punct:]: Matches all punctuation marks, equivalent to [. ,"'?!;:]
[:blank:]: matches spaces and TAB, equivalent to [ t]
[:space:]: matches all blank characters, equivalent to [ tnrfv]
[:cntrl:]: Matches all control characters between ASCII 0 and 31.
[:graph:]: Matches all printable characters, equivalent to: [^ tnrfv]
[:print :]: Matches all printable characters and spaces, equivalent to: [^tnrfv]
[.c.]: Unknown function
[=c=]: Unknown function
[:<: ]: Matches the beginning of a word
[:>:]: Matches the end of a word
PERL compatible regularity (here you can see the power of PERL regularity):
a alarm, that is, BEL character ('0)
cx "control-x ", where 0D)
t tab ('0)
xhh the character with hexadecimal code hh
ddd the character with octal code ddd, or backreference
d any decimal digit
D Any non-decimal character
s Any blank character
S Any non-blank character
w Any "word" character
W Any "non-word" character
b word boundary
B non-word boundary
A beginning of target (independent of multiline mode)
Z end of target or before the ending newline (independent of multiline mode)
z End of target (independent of multiline mode)
G First matching position in target