In PHP, regular expression is a custom grammar rule that describes the character arrangement pattern. It has a very complete grammar system that can write patterns, providing a flexible and intuitive character String processing method. Regular expressions describe a string matching pattern that can be used to check whether a string contains a certain substring, replace the matching substring, or extract a substring that meets a certain condition from a string, etc. wait.
The operating environment of this tutorial: windows7 system, PHP8 version, DELL G3 computer
Maybe you have heard of regular expressions before, roughly The impression is that it is difficult to learn, very complicated, and has a feeling of being unfathomable. In fact, regular expressions are not that mysterious. It is a custom grammar rule that describes the character arrangement pattern.
What is a regular expression?
Regular expressions are also called pattern expressions. They have a very complete set of patterns that can be written. The syntax system provides a flexible and intuitive string processing method. Regular expressions construct patterns with specific rules, compare them with input string information, and use them in specific functions to achieve operations such as string matching, search, replacement, and segmentation.
To give an example in our daily life, if you want to search for all txt format files in a certain directory on your computer, you can enter *.txt in the directory and then press the Enter key. List all txt format files in the directory. The *.txt used here can be understood as a simple regular expression.
The following two examples are constructed using the syntax of regular expressions, as shown below:
/http(s)?:\/\/[\w.]+[\w\/]*[\w.]*\??[\w=&\+\%]*/is // 匹配网址 URL 的正则表达式 /^\w{3,}@([a-z]{2,7}|[0-9]{3})\.(com|cn)$/ // 匹配邮箱地址的正则表达式
Don’t be deterred by the seemingly garbled strings in the above examples, they are expressed according to regular expressions It is a string composed of ordinary characters and characters with special functions. And these strings must be used in specific regular expression functions to be effective.
The purpose of regular expressions
Regular expressions describe a string matching pattern that can be used to check a Whether the string contains a certain substring, replacing the matching substring, or extracting a substring that meets a certain condition from a string, etc. For example, when a user submits a form, to determine whether the entered phone number, email address, etc. is valid, ordinary literal-based character verification is obviously not enough.
Regular expressions are literal patterns composed of ordinary characters (such as the characters a through z) and special characters (called "metacharacters"). A regular expression acts as a template that matches a character pattern with a searched string. A regular expression pattern can be a single character, a collection of characters, a range of characters, a selection between characters, or any combination of all these components.
The purpose of using regular expressions is to achieve powerful functions in a simple way. In order to be simple, effective and powerful, the regular expression rules are complicated. It is even more difficult to construct correct and effective regular expressions, so some effort is required. After getting started, through certain reference and a lot of practice, it is quite effective and interesting to use regular expressions in development practice.
Commonly used terms in regular expressions
Before learning regular expressions, let’s first understand some of them This is an easily confused term, which is of great help in learning regular expressions.
1) grep
was originally a command in the ED editor, used to display specific content in the file. Later became a standalone tool grep.
2) egrep
Although grep is constantly updated and upgraded, it still cannot keep up with the pace of technology. For this reason, Bell Labs wrote egrep, which means "extended grep". This greatly enhances the power of regular expressions.
3) POSIX (Portable Operating System Interface of UNIX)
Portable Operating System Interface. As grep evolved, other developers also created their own versions with unique styles based on their own preferences. But problems also arise. Some programs support certain metacharacters, while others do not. Hence, POSIX. POSIX is a set of standards that ensure portability between operating systems. However, POSIX, like SQL, has not become the final standard and can only be used as a reference.
4) Perl (Practical Extraction and Reporting Language)
Practical Extraction and Reporting Language. In 1987, Larry Wall released Perl. In the following 7 years, from Perl1 to the current Perl5, it eventually became another standard after POSIX.
5) PCRE
The success of Perl has made other developers compatible with "Perl" to some extent, including C/C, Java, Python, etc., which all have their own regular expressions. In 1997, Philip Hazel developed the PCRE library, which is a set of regular expression engines compatible with Perl regular expressions. Other developers can integrate PCRE into their own languages to provide users with rich regular expression functions. PCRE is used by many software, including PHP.
Regular expression syntax rules
Before using regular expressions, we must first learn the syntax of regular expressions. The constituent elements of regular expressions generally include ordinary characters, metacharacters, qualifiers, anchor points, non-printing characters and specified replacements.
1) Ordinary characters
Ordinary characters include all printable and non-printable characters that are not explicitly specified as metacharacters, including all uppercase and lowercase letters, numbers, and punctuation symbols and some symbols. The simplest regular expression is a single ordinary character used to compare search strings. For example, the single-character regular expression /A/ will always match the letter A.
You can also combine multiple single characters to form a longer expression. For example, the regular expression /the/ will match the, there, other and over the lazy dog in the search string. There is no need to use any concatenation operators, just enter the characters consecutively.
2) Metacharacters
In addition to ordinary characters, regular expressions can also contain "metacharacters". Metacharacters can be divided into single-character metacharacters and multi-character metacharacters. For example, the metacharacter \d, which matches numeric characters.
All single-character metacharacters are listed in the following table.
Metacharacters | Behavior | Example |
---|---|---|
* | Matches the preceding character or subexpression zero or more times, equivalent to {0,} | zo* matches "z" and "zoo" |
Matches the preceding character or subexpression one or more times, equivalent to {1,} | zo matches "zo" and "zoo", but not "z" | |
? | Matches the preceding character or subexpression zero or once times, equivalent to {0,1} when ? follows any other qualification (*, ,?, {n}, {n,} or {n,m}), the matching pattern is non-greedy. The non-greedy pattern matches as few strings as possible, while the default greedy pattern matches as many strings as possible | zo? matches "z" and "zo", but Does not match "zoo" o ? Matches only a single "o" in "oooo", while o matches all "o"s do(es)? Matches "do" or "does" "do" matches |
#^ | matches the beginning of the search string. If the m (multiline search) character is included in the flag, ^ will also match the position after \n or \r. If ^ is used as the first character in a bracket expression, the character set is inverted | ^\d{3} matches 3 characters from the beginning of the search string [^ abc] Matches any character except a, b, c |
$ | Matches the end of the search string. If the m (multiline search) character is included in the flag, ^ will also match the position preceding \n or \r. | \d{3}$ matches the 3 digits at the end of the search string |
. | matches anything except the newline character \n any single character. To match any character including \n, use a pattern such as [\s\S] | a.c matches "abc" "a1c" and "a-c" |
[] | Marks the beginning and end of bracket expressions | [1-4] matches "1", "2", "3", or "4" [^aAeEiIoOuU] Matches any non-vowel character |
{} | Marks the beginning and end of the qualifier expression | a {2,3} matches "aa" and "aaa" |
() | Marks the beginning and end of the subexpression, you can save the subexpression to For future use | A(\d) matches "A0" through "A9". Save this number for future use |
| | Indicates a choice between two or more items | z|food with "z ” or “food” matches (z|f)ood matches “zood” or “food” |
represents a text regular expression in JavaScript the beginning and end of the pattern. Adding a single-character flag after the second "/" specifies search behavior | /abc/gi is a JavaScript text regular expression that matches "abc". The g (global) flag specifies to find all occurrences of the pattern, the i (ignore case) flag makes the search case-insensitive | |
Mark the next character Matches the special character, literal, backreference, or octal escape character | \n with a newline character. \( matches "(". \\ matches "\" |
Metacharacters | Behavior | Example |
---|---|---|
\b | with a word Boundary matching. That is, the position between the word and the space | er\b matches the "er" in "never", but does not match the "er" in "verb" |
\B | Non-boundary word matching | er\B matches the "er" in "verb", but not the "er" in "never" |
\d | Number character matching, equivalent to [0-9] | In the search string "12 345", \d{2} matches "12" Matches "34". \d matches "1", "2", "3", "4" and "5" |
\D | Matches non-numeric characters, equivalent to [^0-9] | /D matches "abc" and "def" in "abc123 def" |
\w | matches Matches any character in A-Z, a-z, 0-9 and underscores, which is equivalent to [A-Za-z0-9] | In the search string "The quick brown fox...",\ w matches "The", "quick", "brown" and "fox" |
\W | matches except A-Z, a-z, 0-9 and underscore Matches any character, equivalent to [^A-Za-z0-9] | In the search string "The quick brown fox...", \W with "..." and all spaces Matches the |
[xyz] | character set, matches any one of the specified characters | [abc] and matches the "a" in "plain" |
[^xyz] | Reverse character set, matches any character not specified | [^abc] Same as in "plain" "p", "1", "i" and "n" match the |
[a-z] | character range, matching any character within the specified range | [a-z] Matches any lowercase alphabetic character in the range "a" to "z" |
[^a-z] | The reverse character range, with Matches any character not in the specified range | [^a-z] Matches any character not in the range 'a' to 'z' |
{n} | Match exactly n times, n is a non-negative integer | o{2} does not match the "o" in "Bob", but matches both "o"s in "fooood" |
{n,} | Match at least n times, n is a non-negative integer *Equal to {0,} Equal to {1,} | o{2} does not match "o" in "Bob" but matches all "o"s in "fooood" |
{n,m} | Match at least n times and at most m times. n and m are non-negative integers, where n<= m, there cannot be a space between the comma and the number ? Equivalent to {0,1} | In the search string "1234567", \d{ 1,3} matches "123", "456" and "7" |
(pattern) | Matches the pattern and saves the match. Saved matches can be retrieved from array elements returned by the exec Method in JavaScript. To match the bracket character (), use "\(" or "\)" | (Chapter|Section) [1-9] Matches "Chapter 5", save "Chapter" for future use Use |
(?:pattern) | to match the pattern but not save the match, i.e. the match will not be stored for future use. This is useful when combining pattern parts with the "or" character (|) | industry(?:y|ies) is equal to industry|industries |
(? =Mode) | Positive prediction goes first. Once a match is found, the search for the next match begins before the matching text. Matches will not be saved for future use | ^(?=_.*\d.{4,8}$ Apply the following restrictions to the password: It must be between 4 and 8 characters long between and must contain at least one digit, in this pattern, *\d looks for any number of characters followed by a digit. For the search string "abc3qr", matches "abc3". From before this match , (instead of after) starting with {4,8} matches a string containing 4~8 characters, matching "abc3qr". ^ and $ specify the start and end positions of the search string and will prevent matching if the search string contains any characters other than the matching characters |
(?! pattern) | Negative predictions go first. Matches a search string that does not match the pattern. Once a match is found, the search for the next match begins before the matching text. Matches are not saved for future use | \b(?!th)/w \b matches words that do not begin with "th" In this pattern, \b matches a word boundary. For the search string "quick", matches the first space. (?!th) matches a non-"th" string matches "qu", starting from that match, !w matches one word, i.e. matches "quick" |
\cx | matches the control character indicated by x. The value of x must be in the range A-Z or a-z. If not, c is assumed to be the literal "c" character itself | \cM matches Ctrl M or a carriage return character |
\xn | Match n, where n is a hexadecimal escape code. The hexadecimal escape code must be exactly two digits long. ASCII codes are allowed in regular expressions | \x41 matches "A", \x41 is equivalent to "\x04" followed by "1" (since n must be exactly two digits) |
\num | Matches num, where num is a positive integer. This is a reference to a match saved with | (.)\1 Matches two consecutive identical characters |
\n | identifies An octal escape code or backreference. If \n is preceded by at least n capturing subexpressions, then n is a backreference; otherwise, if n is an octal number (0-7), then n is an octal escape code | (\d) \1 Matches two consecutive identical digits |
\nm | identifies an octal escape code or backreference. If \nm is preceded by at least nm capturing subexpressions, then nm is a backreference. If \nm is preceded by at least n capturing subexpressions, then n is a backreference followed by the text m. If none of the above conditions exist, when n and m are octal digits (0-7), \nm matches the octal escape code nm | \11 matches the tab character |
\nml | When n is an octal digit (0-3), m and 1 are octal digits (0-7), match the octal escape code nml | \011 Matches the tab character |
\un | Matches n, where n is a Unicode character represented as a 4-digit decimal number | \u00A9 and Copyright Symbol (©️) matches |
3) Non-printing characters
Non-printing characters are composed of ordinary characters and escape characters. Characters used to match specific behaviors in regular expressions, such as line feeds, form feeds, whitespace characters, etc. The following table lists nonprinting characters. The characters
and are equivalent to | \f | |
---|---|---|
\x0c and \cL | ##\n | |
\x0a and \cJ | \r | |
\x0d and \cM | \s | |
[\f\b\r\t\v] | \S | |
[^\f\b\r\t\v] | ##\t | Tab character |
\v | Vertical tab | |
4) Priority order |
Order
1 | \ | |
---|---|---|
2 | ( ), (?:), (?=), [ ] | |
3 | *, ,{n},{n,},{n,m} | |
4 | ^,$,\ any metacharacter | |
5 | | | |