Regular expressions are cumbersome, but powerful. The application after learning them will not only improve your efficiency, but also give you an absolute sense of accomplishment. As long as you read these materials carefully and make certain references when applying them, mastering regular expressions is not a problem.
1. Introduction
At present, regular expressions have been widely used in many software, including *nix (Linux, Unix, etc.), HP and other operating systems, PHP, C#, Java and other development environments, as well as many application software. You can see the shadow of regular expressions.
The use of regular expressions can achieve powerful functions in a simple way. In order to be simple and effective yet powerful, the regular expression code is more difficult and not easy to learn, so it requires some effort. After getting started, it is relatively simple and effective to use it by referring to certain references.
Example: ^.+@.+..+$
Such code has scared me away many times. Maybe many people are scared away by such code. Continuing reading this article will give you the freedom to apply code like this too.
Note: Part 7 here seems to be somewhat repetitive with the previous content. The purpose is to re-describe the parts in the previous table to make these contents easier to understand.
2. History of regular expressions
The "ancestors" of regular expressions can be traced all the way back to early research on how the human nervous system works. Two neurophysiologists, Warren McCulloch and Walter Pitts, developed a mathematical way to describe these neural networks.
In 1956, a mathematician named Stephen Kleene published a paper titled "Representation of Neural Network Events" based on the early work of McCulloch and Pitts, introducing the concept of regular expressions. Regular expressions are used to describe expressions that he calls "the algebra of regular sets," hence the term "regular expression."
It was subsequently discovered that this work could be applied to some early research using the computational search algorithms of Ken Thompson, the principal inventor of Unix. The first practical application of regular expressions was the qed editor in Unix.
As they say, the rest is history as we all know. Regular expressions have been an important part of text-based editors and search tools ever since.
3. Regular expression definition
Regular expression (regular expression) describes a string matching pattern, which can be used to check whether a string contains a certain substring, replace the matching substring, or extract from a string that meets a certain condition substrings, etc.
When listing directories, *.txt in dir *.txt or ls *.txt is not a regular expression, because the meaning of * here is different from the * in regular expressions.
Regular expressions are text patterns composed of ordinary characters (such as the characters a to z) and special characters (called metacharacters). A regular expression acts as a template that matches a character pattern with a searched string.
3.1 Common characters
consists of all those printing and non-printing characters that are not explicitly designated as metacharacters. This includes all uppercase and lowercase alphabetic characters, all numbers, all punctuation, and some symbols.
3.2 Non-printing characters
Character meaning
cx matches the control character specified by x. For example, cM matches a Control-M or carriage return character. The value of x must be one of A-Z or a-z. Otherwise, c is treated as a literal 'c' character.
f matches a form feed. Equivalent to x0c and cL.
n matches a newline character. Equivalent to x0a and cJ.
r matches a carriage return character. Equivalent to x0d and cM.
s matches any whitespace character, including spaces, tabs, form feeds, etc. Equivalent to [fnrtv].
S matches any non-whitespace character. Equivalent to [^ fnrtv].
t matches a tab character. Equivalent to x09 and cI.
v matches a vertical tab character. Equivalent to x0b and cK.
3.3 Special characters
The so-called special characters are characters with special meanings, such as the * in "*.txt" mentioned above, which simply means the meaning of any string. If you want to find files with * in the file name, you need to escape the *, that is, add one in front of it. ls *.txt. Regular expressions have the following special characters.
Special character description
$ matches the end of the input string. If the RegExp object's Multiline property is set, $ also matches 'n' or 'r'. To match the $ character itself, use $.
() marks the start and end of a subexpression. Subexpressions can be obtained for later use. To match these characters, use ( and ).
* Matches the preceding subexpression zero or more times. To match the * character, use *.
+ Matches the previous subexpression one or more times. To match the + character, use +.
. Matches any single character except the newline character n. To match ., use .
[Marks the beginning of a square bracket expression. To match [, use [.
? Matches the preceding subexpression zero or once, or specifies a non-greedy qualifier. To match the ? character, use ?.
Mark the next character as either a special character, a literal character, a backward reference, or an octal escape character. For example, 'n' matches the character 'n'. 'n' matches a newline character. The sequence '' matches "", while '(' matches "(".
^ matches the beginning of the input string, unless used in a square bracket expression, in which case it indicates that the character set is not accepted. To match the ^ character itself, use ^.
{Marks the beginning of the qualifier expression. To match {, use {.
|Indicate a choice between two items. To match |, use |.
Constructing regular expressions is the same as creating mathematical expressions. That is, using a variety of metacharacters and operators to combine small expressions to create larger expressions. The components of a regular expression can be a single character, a collection of characters, a range of characters, a selection between characters, or any combination of all of these components.
3.4 Qualifier
The qualifier is used to specify how many times a given component of a regular expression must appear to satisfy a match. There are 6 types: * or + or ? or {n} or {n,} or {n,m}.
The *, + and ? qualifiers are all greedy, because they will match as much text as possible. Only adding a ? after them can achieve non-greedy or minimal matching.
The qualifiers of regular expressions are:
Character description
* Matches the preceding subexpression zero or more times. For example, zo* matches "z" and "zoo". * Equivalent to {0,}.
+ Matches the previous subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, "do(es)?" would match "do" or "do" in "does". ? Equivalent to {0,1}.
{n}n is a non-negative integer. Match a certain number of n times. For example, 'o{2}' does not match the 'o' in "Bob", but it does match both o's in "food".
{n,}n is a non-negative integer. Match at least n times. For example, 'o{2,}' does not match the 'o' in "Bob", but it matches all o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}m and n are both non-negative integers, where n <= m. Match at least n times and at most m times. For example, "o{1,3}" will match the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be a space between the comma and the two numbers.
3.5 Locator
Used to describe the boundary of a string or a word, ^ and $ refer to the beginning and end of the string respectively, b describes the front or back boundary of a word, and B represents a non-word boundary. Qualifiers cannot be used on locators.
3.6 Select
Use parentheses to enclose all selections, and separate adjacent selections with |. However, using parentheses will have a side effect, that is, related matches will be cached. In this case, you can use ?: before the first option to eliminate this side effect.
Among them, ?: is one of the non-capturing elements, and the other two non-capturing elements are ?= and ?!. These two have more meanings. The former is a forward lookup, and it starts to match any parentheses. The regular expression pattern matches the search string at any position that does not match the regular expression pattern. The latter is negative lookahead, which matches the search string at any initial position that does not match the regular expression pattern.
3.7 Backreferences
Adding parentheses around a regular expression pattern or part of a pattern will cause the relevant matches to be stored in a temporary buffer. Each submatch captured will be as encountered from left to right in the regular expression pattern. Content storage. The buffers in which submatches are stored are numbered starting from 1 and numbered consecutively up to a maximum of 99 subexpressions. Each buffer can be accessed using 'n', where n is a one- or two-digit decimal number that identifies a particular buffer.
You can use the non-capturing metacharacters '?:', '?=', or '?!' to ignore the preservation of related matches.
4. Operational precedence of various operators
Operations with the same priority are performed from left to right, and operations with different priorities are performed from high to low. The precedence of various operators from high to low is as follows:
Operator description
Escape character
(), (?:), (?=), []round brackets and square brackets
*, +, ?, {n}, {n,}, {n,m}qualifier
^, $, anymetacharacter position and order
|“OR” operation
5. Explanation of all symbols
Character description
Mark the next character as a special character, a literal character, a backreference, or an octal escape character. For example, 'n' matches the character "n". 'n' matches a newline character. The sequence '' matches "" and "(" matches "(".
^matches the beginning of the input string. If the RegExp object's Multiline property is set, ^ also matches the position after 'n' or 'r'.
$matches the end position of the input string. If the RegExp object's Multiline property is set, $ also matches the position before 'n' or 'r'.
* Matches the preceding subexpression zero or more times. For example, zo* matches "z" and "zoo". * Equivalent to {0,}.
+ Matches the previous subexpression one or more times. For example, 'zo+' matches "zo" and "zoo", but not "z". + is equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, "do(es)?" would match "do" or "do" in "does". ? Equivalent to {0,1}.
{n}n is a non-negative integer. Match a certain number of n times. For example, 'o{2}' does not match the 'o' in "Bob", but it does match both o's in "food".
{n,}n is a non-negative integer. Match at least n times. For example, 'o{2,}' does not match the 'o' in "Bob" but does match all o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}m and n are both non-negative integers, where n <= m. Match at least n times and at most m times. For example, "o{1,3}" will match the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be a space between the comma and the two numbers.
?When this character immediately follows any other qualifier (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. Non-greedy mode matches as little of the searched string as possible, while the default greedy mode matches as much of the searched string as possible. For example, for the string "oooo", 'o+?' will match a single "o", while 'o+' will match all 'o's.
. Matches any single character except "n". To match any character including 'n', use a pattern like '[.n]'.
(pattern) matches pattern and gets this match. The matches obtained can be obtained from the generated Matches collection, using the SubMatches collection in VBScript or the $0…$9 properties in JScript. To match parentheses characters, use '(' or ')'.
(?:pattern) matches pattern but does not obtain the matching result, which means that this is a non-acquisition match and is not stored for later use. This is useful when using the "or" character (|) to combine parts of a pattern. For example, 'industr(?:y|ies) is a shorter expression than 'industry|industries'.
(?=pattern) forward lookup, matching the search string at the beginning of any string matching pattern. This is a non-fetch match, that is, the match does not need to be fetched for later use. For example, 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not "Windows" in "Windows 3.1". Prefetching does not consume characters, that is, after a match occurs, the search for the next match begins immediately after the last match, rather than starting after the character containing the prefetch.
(?!pattern) Negative lookup, matches the search string at the beginning of any string that does not match pattern. This is a non-fetch match, that is, the match does not need to be fetched for later use. For example, 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1", but not "Windows" in "Windows 2000". Prefetching does not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, rather than starting after the character containing the prefetch
x|y matches x or y. For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food".
[xyz] character set. Matches any one of the characters contained. For example, '[abc]' matches 'a' in "plain".
[^xyz] Negative value character set. Matches any character not included. For example, '[^abc]' matches the 'p' in "plain".
[a-z] character range. Matches any character within the specified range. For example, '[a-z]' matches any lowercase alphabetic character in the range 'a' through 'z'.
[^a-z] Negative character range. Matches any character not within the specified range. For example, '[^a-z]' matches any character that is not in the range 'a' to 'z'.
b matches a word boundary, which refers to the position between a word and a space. For example, 'erb' matches the 'er' in "never" but not the 'er' in "verb".
B matches non-word boundaries. 'erB' matches 'er' in "verb" but not in "never".
cx matches the control character specified by x. For example, cM matches a Control-M or carriage return character. The value of x must be one of A-Z or a-z. Otherwise, c is treated as a literal 'c' character.
d matches a numeric character. Equivalent to [0-9].
D matches a non-numeric character. Equivalent to [^0-9].
f matches a form feed. Equivalent to x0c and cL.
n matches a newline character. Equivalent to x0a and cJ.
r matches a carriage return character. Equivalent to x0d and cM.
s matches any whitespace character, including spaces, tabs, form feeds, etc. Equivalent to [fnrtv].
S matches any non-whitespace character. Equivalent to [^ fnrtv].
t matches a tab character. Equivalent to x09 and cI.
v matches a vertical tab character. Equivalent to x0b and cK.
w matches any word character including an underscore. Equivalent to '[A-Za-z0-9_]'.
W matches any non-word character. Equivalent to '[^A-Za-z0-9_]'.
xn matches n, where n is the hexadecimal escape value. The hexadecimal escape value must be exactly two digits long. For example, 'x41' matches "A". 'x041' is equivalent to 'x04' & "1". ASCII encoding can be used in regular expressions. .
num matches num, where num is a positive integer. A reference to the match obtained. For example, '(.)1' matches two consecutive identical characters.
n identifies an octal escape value or a backreference. If n is preceded by at least n fetched subexpressions, n is a backward reference. Otherwise, if n is an octal number (0-7), then n is an octal escape value.
nm identifies an octal escape value or a backreference. nm is a backward reference if nm is preceded by at least nm obtainable subexpressions. If nm is preceded by at least n obtains, then n is a backward reference followed by the literal m. If neither of the previous conditions is true, and if n and m are both octal digits (0-7), nm will match the octal escape value nm.
nml If n is an octal digit (0-3), and m and l are both octal digits (0-7), then matches the octal escape value nml.
un matches n, where n is a Unicode character represented by four hexadecimal digits. For example, u00A9 matches the copyright symbol (?).
6. Some examples
Regular expression description
/b([a-z]+) 1b/gi The position where a word appears continuously
/(w+)://([^/:]+)(:d*)?([^# ]*)/Resolve a URL into protocol, domain, port and relative path
/^(?:Chapter|Section) [1-9][0-9]{0,1}$/Locate the position of the chapter
/[-a-z]/A to z, a total of 26 letters plus a - sign.
/terb/ can match chapter, but not terminal
/Bapt/ can match chapter, but not aptitude
/Windows(?=95 |98 |NT )/ can match Windows95 or Windows98 or WindowsNT. When a match is found, the next search match starts from behind Windows.
7. Regular expression matching rules
7.1 Basic pattern matching
Everything starts from the basics. Patterns are the most basic elements of regular expressions. They are a set of characters that describe the characteristics of a string. Patterns can be simple, consisting of ordinary strings, or very complex, often using special characters to represent a range of characters, recurrences, or to represent context. For example:
^once
This pattern contains a special character ^, which means that the pattern only matches those strings starting with once. For example, this pattern matches the string "once upon a time" but does not match "There once was a man from NewYork". Just like the ^ symbol indicates the beginning, the $ symbol is used to match strings that end with a given pattern.
bucket$
This pattern matches "Who kept all of this cash in a bucket" but does not match "buckets". When the characters ^ and $ are used at the same time, it means an exact match (the string is the same as the pattern). For example:
^bucket$
Only matches the string "bucket". If a pattern does not include ^ and $, then it matches any string that contains the pattern. For example: pattern
once
and string
There once was a man from NewYork
Who kept all of his cash in a bucket.
is a match.
The letters (o-n-c-e) in this pattern are literal characters, that is, they represent the letters themselves, and the same goes for numbers. Some other slightly more complex characters, such as punctuation marks and white characters (spaces, tabs, etc.), require escape sequences. All escape sequences begin with a backslash (). The escape sequence for the tab character is: t. So if we want to detect whether a string starts with a tab character, we can use this pattern:
^t
Similarly, use n to represent "new line" and r to represent carriage return. Other special symbols can be used with a backslash in front. For example, the backslash itself is represented by ., the period is represented by ., and so on.
7.2 Character Clusters
In INTERNET programs, regular expressions are usually used to verify user input. When a user submits a FORM, it is not enough to use ordinary literal characters to determine whether the entered phone number, address, email address, credit card number, etc. are valid.
So we need to use a more free way to describe the pattern we want, which is character clusters. To create a cluster representing all vowel characters, place all vowel characters in square brackets:
[AaEeIiOoUu]
This pattern matches any vowel character, but can only represent one character. Use hyphens to represent a range of characters, such as:
[a-z] //Match all lowercase letters
[A-Z] //Match all uppercase letters
[a-zA-Z] //Match all letters
[0-9] //Match all numbers
[0-9.-] //Match all numbers, periods and minus signs
[ frtn] //Match all white characters
Again, these only represent one character, which is a very important one. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6" or "g7", but not "ab2", "r2d3" or "b52", use this pattern:
^[a-z][0-9]$
Although [a-z] represents a range of 26 letters, here it can only match strings whose first character is a lowercase letter.
It was mentioned earlier that ^ represents the beginning of a string, but it also has another meaning. When ^ is used within a set of square brackets, it means "not" or "exclude" and is often used to eliminate a certain character. Using the previous example, we require that the first character cannot be a number:
^[^0-9][0-9]$
This pattern matches "&5", "g7" and "-2", but does not match "12" and "66". Here are a few examples of excluding specific characters:
[^a-z] //All characters except lowercase letters
[^/^] //All characters except ()(/)(^)
[^"'] //All characters except double quotes (") and single quotes (')
The special characters "." (dot, period) are used in regular expressions to represent all characters except "new line". So the pattern "^.5$" matches any two-character string that ends with the number 5 and begins with some other non-"newline" character. The pattern "." can match any string, except empty strings and strings containing only a "new line".
PHP’s regular expressions have some built-in common character clusters, the list is as follows:
Character cluster meaning
[[:alpha:]] any letter
[[:digit:]] any number
[[:alnum:]] Any letters and numbers
[[:space:]] Any white character
[[:upper:]] Any uppercase letter
[[:lower:]] Any lowercase letter
[[:punct:]] Any punctuation mark
[[:xdigit:]] Any hexadecimal number, equivalent to [0-9a-fA-F]
7.3 Determine recurrence
By now, you already know how to match a letter or number, but more often than not, you may want to match a word or a group of numbers. A word consists of several letters, and a group of numbers consists of several singular numbers. The curly braces ({}) following a character or character cluster are used to determine the number of times the preceding content is repeated.
Character cluster meaning
^[a-zA-Z_]$ All letters and underscores
^[[:alpha:]]{3}$ All 3-letter words
^a$ letter a
^a{4}$ aaaa
^a{2,4}$ aa,aaa or aaaa
^a{1,3}$ a,aa or aaa
^a{2,}$ A string containing more than two a's
^a{2,} Such as: aardvark and aaab, but not apple
a{2,} Such as: baad and aaa, but not Nantucket
t{2} Two tab characters
.{2} All two characters
These examples describe three different uses of curly braces. A number, {x} means "the preceding character or character cluster appears only x times"; a number plus a comma, {x,} means "the preceding content appears x or more times"; two Comma-separated numbers, {x,y} means "the previous content appears at least x times, but not more than y times". We can extend the pattern to more words or numbers:
^[a-zA-Z0-9_]{1,}$ //All strings containing more than one letter, number or underscore
^[0-9]{1,}$ //All positive numbers
^-{0,1}[0-9]{1,}$ //All integers
^-{0,1}[0-9]{0,}.{0,1}[0-9]{0,}$ //All decimals
The last example is not easy to understand, is it? Look at it this way: with all numbers starting with an optional negative sign (-{0,1}) (^), followed by 0 or more digits ([ 0-9]{0,}), and an optional decimal point (.{0,1}) followed by 0 or more digits ([0-9]{0,}), and nothing else ($). Below you will learn about the simpler methods you can use.
The special character "?" is equivalent to {0,1}, they both represent: "0 or 1 previous content" or "the previous content is optional". So the example just now can be simplified to:
^-?[0-9]{0,}.?[0-9]{0,}$
The special characters "*" are equal to {0,}, they both represent "0 or more previous contents". Finally, the character "+" is equal to {1,}, which means "1 or more previous contents", so the above 4 examples can be written as:
^[a-zA-Z0-9_]+$ //All strings containing more than one letter, number or underscore
^[0-9]+$ //All positive numbers
^-?[0-9]+$ //All integers
^-?[0-9]*.?[0-9]*$ //All decimals
Of course this doesn’t technically reduce the complexity of regular expressions, but it makes them easier to read.