Parsing the difference between posix and perl standard regular expressions

Home

Backend Development

PHP Tutorial

Parsing the difference between posix and perl standard regular expressions_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 21, 2016 pm 03:05 PM

perl posix and the difference standard regular of abbreviation expression parse

正则表达式（Regular Expression，缩写为regexp，regex或regxp)，又称正规表达式、正规表示式或常规表达式或正规化表示法或正规表示法，是指一个用来描述或者匹配一系列符合某个句法规则的字符串的单个字符串。在很多文本编辑器或其他工具里，正则表达式通常被用来检索和/或替换那些符合某个模式的文本内容。许多程序设计语言都支持利用正则表达式进行字符串操作。例如，在Perl中就内建了一个功能强大的在正则表达式引擎。正则表达式这个概念最初是由 Unix中的工具软件（例如sed和grep）普及开的。（摘自维基百科）

PHP同时使用两套正则表达式规则，一套是由电气和电子工程师协会（IEEE）制定的POSIX Extended 1003.2兼容正则（事实上PHP对此标准的支持并不完善），另一套来自PCRE（Perl Compatible Regular Expression）库提供PERL兼容正则，这是个开放源代码的软件，作者为 Philip Hazel。

使用POSIX兼容规则的函数有：
ereg_replace()
ereg()
eregi()
eregi_replace()
split()
spliti()
sql_regcase()
mb_ereg_match()
mb_ereg_replace()
mb_ereg_search_getpos()
mb_ereg_search_getregs()
mb_ereg_search_init()
mb_ereg_search_pos()
mb_ereg_search_regs()
mb_ereg_search_setpos()
mb_ereg_search()
mb_ereg()
mb_eregi_replace()
mb_eregi()
mb_regex_encoding()
mb_regex_set_options()
mb_split()

使用PERL兼容规则的函数有：
preg_grep()
preg_replace_callback()
preg_match_all()
preg_match()
preg_quote()
preg_split()
preg_replace()

定界符：

POSIX兼容正则没有定界符，函数的相应参数会被认为是正则。

PERL兼容正则可以使用任何不是字母、数字或反斜线（\）的字符作为定界符，如果作为定界符的字符必须被用在表达式本身中，则需要用反斜线转义。也可以使用()，{}，[] 和 <> 作为定界符

修正符：

POSIX兼容正则没有修正符。

PERL兼容正则中可能使用的修正符（修正符中的空格和换行被忽略，其它字符会导致错误）：

i (PCRE_CASELESS)：
匹配时忽略大小写。

m（PCRE_MULTILINE）：
当设定了此修正符，行起始(^)和行结束($)除了匹配整个字符串开头和结束外，还分别匹配其中的换行符(\n)的之后和之前。

s（PCRE_DOTALL）：
如果设定了此修正符，模式中的圆点元字符（.）匹配所有的字符，包括换行符。没有此设定的话，则不包括换行符。

x（PCRE_EXTENDED）：
如果设定了此修正符，模式中的空白字符除了被转义的或在字符类中的以外完全被忽略。

e：
如果设定了此修正符，preg_replace() 在替换字符串中对逆向引用作正常的替换，将其作为 PHP 代码求值，并用其结果来替换所搜索的字符串。只有 preg_replace() 使用此修正符，其它 PCRE 函数将忽略之。

A（PCRE_ANCHORED）：
如果设定了此修正符，模式被强制为“anchored”，即强制仅从目标字符串的开头开始匹配。

D（PCRE_DOLLAR_ENDONLY）：
如果设定了此修正符，模式中的行结束($)仅匹配目标字符串的结尾。没有此选项时，如果最后一个字符是换行符的话，也会被匹配在里面。如果设定了 m 修正符则忽略此选项。

S：
当一个模式将被使用若干次时，为加速匹配起见值得先对其进行分析。如果设定了此修正符则会进行额外的分析。目前，分析一个模式仅对没有单一固定起始字符的 non-anchored 模式有用。

U（PCRE_UNGREEDY）：
使“?”的默认匹配成为贪婪状态的。

X（PCRE_EXTRA）：
模式中的任何反斜线后面跟上一个没有特殊意义的字母导致一个错误，从而保留此组合以备将来扩充。默认情况下，一个反斜线后面跟一个没有特殊意义的字母被当成该字母本身。

u（PCRE_UTF8）：
模式字符串被当成UTF-8。

逻辑区隔：
POSIX兼容正则和PERL兼容正则的逻辑区隔符号作用和使用方法完全一致：
[]：包含任选一操作的相关信息。
{}：包含匹配次数的相关信息。
()：包含一个逻辑区间的相关信息，可被用来进行引用操作。
|：表示“或”，[ab]和a|b是等价的。

The

metacharacter is related to "[]":

There are two different sets of metacharacters: one is recognized within the pattern except square brackets, and the other is recognized within square brackets "[]".

Posix-compatible regular and PERL-compatible regular "[]" and "consistent" metacharacters:
Universal escape character with several uses
^ Match the beginning of the string
$ Match the end of the string
? Match 0 or 1
* Match 0 or more characters of the previously specified type
+ Match 1 or more previously specified types The characters

POSIX-compatible regex and PERL-compatible regex "inconsistent" metacharacters "other than []":
. PERL-compatible regex matches any one except a newline character Character
. POSIX compatible regular match any character

POSIX compatible regular and PERL compatible regular "within []" "consistent" metacharacters:
Universal escape character with several uses
^ Negate the character, but only valid when it is the first character
- Specify the character ASCII range, study the ASCII code carefully, you will find that [W-c] is equivalent to [WXYZ\^_`abc]

POSIX-compatible regular expressions and PERL-compatible regular expressions "within []" are "inconsistent" metacharacters:
- The specification of [a-c-e] in POSIX-compatible regular expressions will throw An error occurred.
- The specification of [a-c-e] in PERL compatible regular expressions is equivalent to [a-e].

The number of matches is related to "{}":
POSIX compatible regular expressions and PERL compatible regular expressions are exactly the same in terms of matching times:
{2}: indicates matching The previous character 2 times
{2,}: means matching the previous character 2 or more times, the default is greedy (as many as possible) matching
{2,4}: means matching the previous character 2 times or 4 times

Logical intervals are related to "()":
The area enclosed by () is a logical interval. The main function of the logical interval is to reflect the occurrence of some characters. The logical order, another use is that it can be used for reference (the value in this range can be referenced to a variable). The latter function is rather strange:
$str = "http://www.163.com/";
// POSIX compatible regular:
echo ereg_replace("(. +)","\1",$str);
// PERL compatible regular:
echo preg_replace("/(.+)/", "$1",$str);
// Display two links
?>

When quoting, parentheses can be nested, and the logical order is calibrated according to the order in which "(" appears.

Type matching:
POSIX compatible regular:
[:upper:]: matches all uppercase letters
[: lower:]: Matches all lowercase letters
[:alpha:]: Matches all letters
[:alnum:]: Matches all letters and numbers
[:digit:]: Matches all numbers
[:xdigit:]: Matches all hexadecimal characters, equivalent to [0-9A-Fa-f]
[:punct:]: Matches all punctuation marks, equivalent to [. ,"'?!;:]
[:blank:]: matches spaces and TAB, equivalent to [ t]
[:space:]: matches all blank characters, equivalent to [ tnrfv]
[:cntrl:]: Matches all control characters between ASCII 0 and 31.
[:graph:]: Matches all printable characters, equivalent to: [^ tnrfv]
[:print :]: Matches all printable characters and spaces, equivalent to: [^tnrfv]
[.c.]: Unknown function
[=c=]: Unknown function
[:<: ]: Matches the beginning of a word
[:>:]: Matches the end of a word

PERL compatible regularity (here you can see the power of PERL regularity):
a alarm, that is, BEL character ('0)
cx "control-x ", where 0D)
t tab ('0)
xhh the character with hexadecimal code hh
ddd the character with octal code ddd, or backreference
d any decimal digit
D Any non-decimal character
s Any blank character
S Any non-blank character
w Any "word" character
W Any "non-word" character
b word boundary
B non-word boundary
A beginning of target (independent of multiline mode)
Z end of target or before the ending newline (independent of multiline mode)
z End of target (independent of multiline mode)
G First matching position in target

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Will R.E.P.O. Have Crossplay?

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7549

CakePHP Tutorial

1382

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

The difference between char and wchar_t in C language Apr 03, 2025 pm 03:09 PM

In C language, the main difference between char and wchar_t is character encoding: char uses ASCII or extends ASCII, wchar_t uses Unicode; char takes up 1-2 bytes, wchar_t takes up 2-4 bytes; char is suitable for English text, wchar_t is suitable for multilingual text; char is widely supported, wchar_t depends on whether the compiler and operating system support Unicode; char is limited in character range, wchar_t has a larger character range, and special functions are used for arithmetic operations.

The difference between multithreading and asynchronous c# Apr 03, 2025 pm 02:57 PM

The difference between multithreading and asynchronous is that multithreading executes multiple threads at the same time, while asynchronously performs operations without blocking the current thread. Multithreading is used for compute-intensive tasks, while asynchronously is used for user interaction. The advantage of multi-threading is to improve computing performance, while the advantage of asynchronous is to not block UI threads. Choosing multithreading or asynchronous depends on the nature of the task: Computation-intensive tasks use multithreading, tasks that interact with external resources and need to keep UI responsiveness use asynchronous.

What is the function of C language sum? Apr 03, 2025 pm 02:21 PM

There is no built-in sum function in C language, so it needs to be written by yourself. Sum can be achieved by traversing the array and accumulating elements: Loop version: Sum is calculated using for loop and array length. Pointer version: Use pointers to point to array elements, and efficient summing is achieved through self-increment pointers. Dynamically allocate array version: Dynamically allocate arrays and manage memory yourself, ensuring that allocated memory is freed to prevent memory leaks.

What is the difference between char and unsigned char Apr 03, 2025 pm 03:36 PM

char and unsigned char are two data types that store character data. The main difference is the way to deal with negative and positive numbers: value range: char signed (-128 to 127), and unsigned char unsigned (0 to 255). Negative number processing: char can store negative numbers, unsigned char cannot. Bit mode: char The highest bit represents the symbol, unsigned char Unsigned bit. Arithmetic operations: char and unsigned char are signed and unsigned types, and their arithmetic operations are different. Compatibility: char and unsigned char

What are the basic requirements for c language functions Apr 03, 2025 pm 10:06 PM

C language functions are the basis for code modularization and program building. They consist of declarations (function headers) and definitions (function bodies). C language uses values to pass parameters by default, but external variables can also be modified using address pass. Functions can have or have no return value, and the return value type must be consistent with the declaration. Function naming should be clear and easy to understand, using camel or underscore nomenclature. Follow the single responsibility principle and keep the function simplicity to improve maintainability and readability.

The difference between H5 and mini-programs and APPs Apr 06, 2025 am 10:42 AM

H5. The main difference between mini programs and APP is: technical architecture: H5 is based on web technology, and mini programs and APP are independent applications. Experience and functions: H5 is light and easy to use, with limited functions; mini programs are lightweight and have good interactiveness; APPs are powerful and have smooth experience. Compatibility: H5 is cross-platform compatible, applets and APPs are restricted by the platform. Development cost: H5 has low development cost, medium mini programs, and highest APP. Applicable scenarios: H5 is suitable for information display, applets are suitable for lightweight applications, and APPs are suitable for complex functions.

What are the differences and connections between c and c#? Apr 03, 2025 pm 10:36 PM

Although C and C# have similarities, they are completely different: C is a process-oriented, manual memory management, and platform-dependent language used for system programming; C# is an object-oriented, garbage collection, and platform-independent language used for desktop, web application and game development.

How to set password protection for export PDF on PS Apr 06, 2025 pm 04:45 PM

Export password-protected PDF in Photoshop: Open the image file. Click "File"> "Export"> "Export as PDF". Set the "Security" option and enter the same password twice. Click "Export" to generate a PDF file.

See all articles