A more detailed guide to Python regular expression operations

巴扎黑
Release: 2017-05-21 11:16:38
Original
1670 people have browsed it

Python has added the re module since version 1.5, which provides Perl-style regular expression patterns. Versions before Python 1.5 provided Emecs-style patterns through the regex module. Emacs style mode is slightly less readable and less powerful, so try not to use the regex module when writing new code. Of course, occasionally you may still find traces of it in old code.

By its very nature, regular expressions (or RE) are a small, highly specialized programming language embedded in Python and implemented through the re module. Using this small language, you can specify rules for a corresponding set of strings that you want to match; that set of strings might contain English sentences, e-mail addresses, TeX commands, or whatever you want. You can then ask questions such as "Does this string match the pattern?" or "Does any part of this string match the pattern?". You can also use RE to modify or split strings in various ways.


Regular expression patterns are compiled into a series of bytecodes and then executed by a matching engine written in C. In advanced usage, one might also want to pay careful attention to how the engine executes a given RE, and how the RE is written in a specific way to make the produced bytecode run faster. This article does not cover optimization, as that requires you to fully understand the internal mechanisms of the matching engine.


The regular expression language is relatively small and restricted (limited functionality), so not all string processing can be done with regular expressions. Of course, there are some tasks that can be accomplished with regular expressions, but in the end the expressions will become extremely complex. When you encounter these situations, it may be better to write Python code to deal with it; although Python code is slower than a fancy regular expression, it is easier to understand.

Simple Mode

We will start with the simplest regular expression learning. Since regular expressions are often used for string manipulation, let's start with the most common task: character matching.


For a detailed explanation of the computer science underlying regular expressions (deterministic and non-deterministic finite automata), you can consult any textbook on writing compilers.

Character matching

Most letters and characters will generally match themselves. For example, the regular expression test will exactly match the string "test". (You can also use case-insensitive mode, which will also make this RE match "Test" or "TEST"; more on that later.)

There are of course exceptions to this rule; some characters Rather special, they do not match themselves, but indicate that they should match something special, or they will affect the number of repetitions of other parts of the RE. A large section of this article is devoted to discussing various metacharacters and their functions.

Here is a complete list of metacharacters; their meanings are discussed in the remainder of this guide.

. ^ $ * + ? { [ ] \ | ( )

The metacharacters we first examine are "[" and "]". They are often used to specify a character category, which is a character set you want to match. Characters can be listed individually, or two given characters separated by a "-" sign can be used to represent a character range. For example, [abc] will match any character among "a", "b", or "c"; you can also use the interval [a-c] to represent the same character set, which has the same effect as the former. If you only want to match lowercase letters, then RE should be written as [a-z]. The

metacharacter has no effect in categories. For example, [akm$] will match any of the characters "a", "k", "m", or "$"; "$" is usually used as a metacharacter, but within the character class, its characteristics are removed, Revert to normal characters.

You can use complement to match characters that are not within the range. The method is to use "^" as the first character of the category; "^" elsewhere will simply match the "^" character itself. For example, [^5] will match any character except "5".

Perhaps the most important metacharacter is the backslash """. As a string letter in Python, different characters can be added after the backslash to express different special meanings. It can also be used to cancel all metacharacters. characters so that you can match them in a pattern. For example, if you need to match the characters "[" or """, you can precede them with a backslash to remove their special meaning: "[ or " ".

Some predefined character sets represented by special characters starting with """ are usually useful, such as numeric sets, alphabetic sets, or other non-empty character sets. The following are the preset special characters available:

#!python

#>>> import re

##>>>
p = re.compile( 'ab*')
##>>>
print p##<

re.RegexObject instance at 80b4150
>re.compile() also accepts optional flag parameters, which are often used to implement different special functions and syntax changes. We'll look at all the available settings later, but for now just one example:

Code highlighting produced by Actipro CodeHighlighter (freeware)

http://www.CodeHighlighter.com /




#!python

##>> >
p
= re.compile('ab*', re.IGNORECASE)

RE is sent to re.compile() as a string. REs are processed as strings because regular expressions are not a core part of the Python language, and no specific syntax was created for them. (Applications simply don't need REs, so there's no need to bloat the language specification by including them.) The re module is simply included in Python as a C extension module, just like the socket or zlib modules.


Use REs as strings to ensure the simplicity of the Python language, but this brings about a trouble as mentioned in the title of the next section.

The trouble with backslashes

In earlier regulations, regular expressions used the backslash character (""") to indicate special formatting or to allow the use of a special character without calling its special Usage. This conflicts with Python's use of identical characters in strings.


Let's say you want to write an RE to match the string ""section". Probably looking in a LATEX file. In order to judge in program code, first write the string you want to match. Next you need to precede all backslashes and metacharacters with a backslash to remove their special meaning.

Characters Phase
\section String to match
\\section Cancel the special meaning of backslash for re.compile
"\\\\section" Cancel backslashes for strings


# Simply put, in order to match a backslash, you have to write '\ in the RE string \', because the regular expression must be "\\", and each backslash must be represented as "\\" according to the conventional Python string letter representation. This repetitive nature of backslashes in REs results in a large number of repeated backslashes, and the resulting strings are difficult to understand.


The solution is to use Python's raw string representation for regular expressions; adding an "r" backslash before the string will not be processed in any special way, so r"\ n" is two characters containing "\" and "n", and "\n" is one character, representing a newline. Regular expressions are usually represented by this raw string in Python code.

Regular string Raw string
"ab*" r "ab*"
"\\\\section" r"\\section"
"\\ w+\\s+\\1" r"\w+\s+\1"

Perform the match

Once you have Compiled regular expression object, what do you want to do with it? `RegexObject` instances have some methods and properties. Only the most important ones are shown here. If you want to see the complete list, please consult the Python Library Reference

Methods/Attributes Role
match() Determine whether RE matches at the beginning of the string
search() Scan String, find the position matched by this RE
findall() Find all substrings matched by RE and return them as a list
finditer() Find all substrings matched by RE and return them as an iterator


If no match is found, match() and search() will return None. If successful, a `MatchObject` instance will be returned with information about the match: where it starts and ends, the substring it matched, etc.

You can learn it by using human-computer dialogue and experimenting with the re module. If you have Tkinter, you might consider looking at Tools/scripts/redemo.py, a demo program included with the Python distribution.

First, run the Python interpreter, import the re module and compile a RE:

#!python
##Python
2.2.2 (1, Feb 10 2003, 12:57:01)

#>>>
import re

>>>
p = re.compile(' [a-z]+')

>>>
p

<
_sre.SRE_Pattern object at 80c3c28>

Now, you can try to use RE [a-z]+ to match different strings. An empty string will not match at all, since + means "one or more repetitions". In this case match() will return None because it leaves the interpreter with no output. You can figure this out by printing out the result of match() explicitly.


Code highlighting produced by Actipro CodeHighlighter (freeware)

#!python
#>>>
p.match("")##>>>

print p.match ("")None

Now, let us try to use it to match a string like " tempo". In this case, match() will return a MatchObject. So you can save the result in a variable for later use.

# !python

#>>> m = p.match( 'tempo')

>> > print m

##<
_sre.SRE_Match object at 80c4f68>
Now you can query `MatchObject` for related information about matching strings. MatchObject instances also have several methods and properties; the most important ones are as follows:

##Methods/PropertiesRolegroup()Returns the string matched by REstart()Returns the starting position of the matchend()Returns the position where the match ends span()Returns a tuple containing the match( start, end) position


Try these methods and you will soon know what they do:

#!python

##> >> m.group()

'tempo'

>>> m.start(), m.end()
(0,
5)

>>> m.span()
(0,
5)

group() returns the substring matched by RE. start() and end() return the index at which the match begins and ends. span() returns the start and end indices together in a single tuple. Because the match method checks that if RE starts matching at the beginning of the string, then start() will always be zero. However, the search method of a `RegexObject` instance scans the following string, in which case the match may start at a position other than zero.


#!python

#>>>
print p.match('::: message')None



>>>
m = p.search('::: message') ; print m
##<
re.MatchObject instance at 80c9650>##>>>

m. group()
'

message
'##> >>
m.span()
(4
,
11) In actual programs, the most common approach is to save `MatchObject` in a variable and then check whether it is None, usually as follows:

# !python
##p
= re.compile( )
m
= p.match( 'string goes here' )

if m:

print 'Match found: ', m.group()

else:

print 'No match'
Two `RegexObject` methods Returns all substrings matching the pattern. findall() returns a list of matching string rows:


#!python

>>> p = re.compile('" d+')

##>>>
p.findall(' 12 drummers drumming, 11 pipers piping, 10 lords a-leaping')[

'
12', '11', '10']findall() has to create a list when it returns a result in Python 2.2. , you can also use the finditer() method

.

# !python

#>>> iterator = p.finditer('12 drummers drumming, 11 10 ')

>>> iterator

##<
callable-iterator object at 0x401833ac>
#>>>
for match in iterator:

print
match.span()

(0,
2
)(
22
, 24)(
29
, 31)
Module-level functions

You don’t have to create a `RegexObject` object and then call its methods; re module Top-level function calls such as match(), search(), sub(), etc. are also provided. These functions take a RE string as the first parameter, and subsequent parameters are the same as the corresponding `RegexObject` method parameters, and the return value is either None or an instance of `MatchObject`.

Code highlighting produced by Actipro CodeHighlighter (freeware)

http://www.CodeHighlighter.com/


#!python
#>>>
print re.match(r'From"s+', 'Fromage amk')None
##>>>

re .match(r
'From"s+', 'From amk Thu May 14 19:12:10 1998')<
re.MatchObject instance at 80c5978
>

Under the hood, these functions simply generate a RegexOject and call the corresponding methods on it. They also store compiled objects in cache, so future calls using the same RE will be faster.


Will you use these module-level functions, or get a `RegexObject` first and then call its methods? Which one you choose depends on how efficient you are with RE and your personal coding style. If a RE is only used once in the code, then module-level functions may be more convenient. If the program contains a lot of regular expressions, or reuses the same one in multiple places, it is more useful to put all the definitions together and compile all REs ahead of time in a block of code. Take a look at an example from the standard library, extracted from the xmllib.py file:

#!python
##ref
= re.compile( )
entityref
= re.compile( )
charref
= re.compile( )
starttagopen
= re.compile( )

I usually prefer to use compiled objects, even if it's only used once, but few people will be as much of a purist about this as I am.

Compilation flags

Compilation flags allow you to modify some of the ways regular expressions run. In the re module, the flag can use two names, one is the full name such as IGNORECASE, and the other is the abbreviation, one-letter form such as I. (If you're familiar with Perl's mode modification, the one-letter forms use the same letter; for example, the abbreviation for re.VERBOSE is re.X.) Multiple flags can be specified by bitwise OR-ing them. For example, re.I | re.M is set to the I and M flags:

There is a table of available flags, with detailed instructions after each flag.

FlagMeaningDOTALL, Smakes. Matches include newlines in All characters within IGNORECASE, Imakes the match case-insensitiveLOCALE, LDo locale-aware matchingMULTILINE, MMultiline matching, affecting ^ and $VERBOSE,

I
IGNORECASE

Make matching case-insensitive; character classes and strings ignore case when matching letters. For example, [A-Z] can also match lowercase letters, and Spam can match "Spam", "spam", or "spAM". This lowercase letter does not take into account the current position.

L
LOCALE

Affects "w, "W, "b, and "B, depending on the current localization set up.

locales is a feature in the C library that is used to assist programming where different languages ​​need to be considered. For example, if you're working with French text, you want to use "w+" to match text, but "w only matches the character class [A-Za-z]; it doesn't match "é" or "ç". If your system is configured appropriately and the locale is set to French, an internal C function will tell the program that "é" should also be considered a letter. Using the LOCALE flag when compiling a regular expression will result in a compiled object that uses these C functions to handle "w"; this will be slower, but will also allow you to use "w+ to match French text.

M
MULTILINE


(^ and $ will not be interpreted at this time; they will be introduced in Section 4.1 .)


Use "^" to match only the beginning of the string, while $ will only match the end of the string and the end of the string immediately before the newline (if there is one). When this flag is specified, "^" matches the beginning of the string and the beginning of each line in the string. Likewise, the $ metacharacter matches the end of the string and the end of each line in the string (directly before each newline).

S
DOTALL

Causes the "." special character to match exactly any character, including newlines; without this flag, "." Matches any character except newline.

X
VERBOSE


This flag makes it easier to write regular expressions by giving you a more flexible format understand. When this flag is specified, whitespace within the RE string is ignored unless the whitespace is within a character class or after a backslash; this allows you to organize and indent REs more clearly. It also allows you to write comments to the RE, which will be ignored by the engine; comments are marked with a "#" symbol, but this symbol cannot come after a string or a backslash.


For example, here is a RE using re.VERBOSE; see how much easier it is to read it?

#!python
##charref
= re.compile(r"" "##&[[]]                                              # Decimal form

| 0[0-7]+[^0-7] # Octal form

| x[0-9a-fA-F]+[^0-9a- fA-F] # Hexadecimal form

)



"""

, re.VERBOSE)

No verbose setting, The RE will look like this:


#!python

charref =
re.compile(
"
([0-9]+[^0-9]" "|0[0-7]+[^0-7]

"
"|x[0-9a-fA-F]+[^0-9a-fA-F])

"
)In the above example, Python's string auto-concatenation can be used to break the RE into smaller parts, but it is more difficult to understand than when using the re.VERBOSE flag. More pattern functionsSo far, we have only shown part of the functions of regular expressions. In this section, we'll show you some new metacharacters and how to use groups to retrieve matched parts of text.

More Metacharacters

There are some metacharacters we haven’t shown yet, most of which will be shown in this section.


The remaining metacharacters to discuss are zero-width assertions. They don't make the engine faster when processing strings; rather, they don't correspond to any characters at all, just simple success or failure. For example, "b is an assertion that locates the current position at a word boundary, which position is not changed at all by "b. This means that zero-width assertions will never be repeated, because if they match once at a given position, they can obviously be matched an infinite number of times.

|


Optional, or "or" operator. If A and B are regular expressions, A|B will match any string that matches "A" or "B". | has a very low priority so that it behaves appropriately when you have multiple strings to select. Crow|Servo will match "Crow" or "Servo", but not "Cro", a "w" or an "S", and "ervo".


To match the letter "|", you can use "|, or include it in a character class, such as [|].

^


Matches the beginning of the line. Unless the MULTILINE flag is set, it only matches the beginning of the string. In MULTILINE mode, it can also directly match every newline in the string. #For example, if you only want to match the word "From" at the beginning of the line, then RE will use ^From


Code highlighting produced by Actipro CodeHighlighter (freeware)

http: //www.CodeHighlighter.com/


#!python
##>>>
print re.search('^From ', 'From Here to Eternity')<

re.MatchObject instance at 80c1520
>##>>>

print re.search('^From', 'Reciting From Memory')None
$

Matches the end of the line, which is defined as either the end of the string or any position after a newline character .

# !python

#>>> print re.search('}$', '{block} ')

##<
re.MatchObject instance at 80adfa8>
##>>>
print re.search('}$', '{block} ' )None

##>>>

print re. search('}$', '{block}"n')##<
re.MatchObject instance at 80adfa8
> To match a "$", use "$" or include it in a character class, such as [$]. "A

Only matches the beginning of the string. When not in MULTILINE mode, "A and ^ are actually the same. However, they are different in MULTILINE mode; "A only matches the beginning of the string, while ^ can also match anywhere in the string after the newline character.

"Z


Matches only at the end of the string.

Matches only at the end of the string.

"b


Word boundary. This is a zero-width assertion that only matches the beginning and end of a word. A word is defined as a Alphanumeric sequences, so word endings are marked with whitespace or non-alphanumeric characters

The following example only matches the entire word "class"; not when it is contained in other words. match.

# !python

#>>> p = re.compile(r'"bclass"b')

##> ;>>
print p.search('no class at all')
##<
re.MatchObject instance at 80c8f28>
>>>
print p.search(' the declassified algorithm')None

##>>>

print p.search('one subclass is')None
There are two subtleties here that you should keep in mind when using this particular sequence. The first is the worst conflict between Python strings and regular expressions. In Python strings, ""b" is a backslash character, and its ASCII value is 8. If you do not use a raw string, then Python will convert ""b" into a fallback character, and your RE You won't be able to match it as you would like. The following example looks the same as our previous RE, but without an "r" in front of the RE string.

# !python

#>>> p = re.compile('"bclass"b')

##> >>
print p.search('no class at all ')None


##>>>
print p.search('"b' + 'class' + '" b')
<
re.MatchObject instance at 80c3ee0>The second one is in the character class. This qualifier (assertion) has no effect. "b represents the fallback character for compatibility with Python strings.

"B

Another zero-width assertion (zero-width assertions), which is just the opposite of "b, and only matches when the current position is not at a word boundary.

Grouping

You often need more information than whether the RE matches. Regular expressions are often used to parse a string, write a RE to match the parts of interest and divide it into several groups. For example, an RFC. The header of -822 is separated into a header name and a value by ":". This can be done by writing a regular expression to match the entire header, with one set matching the header name and the other set matching the header value. Processing.

Groups are identified by the "(" and ")" metacharacters. "(" and ")" have much the same meaning in mathematical expressions; they are used together in them. The expressions inside form a group. For example, you can use repetition qualifiers like *, +, ?, and {m,n} to repeat the contents of the group. For example, (ab)* will match zero or More repetitions of "ab".


Code highlighting produced by Actipro CodeHighlighter (freeware)

http://www.CodeHighlighter.com/


#!python
#>>>
p
= re.compile('(ab)*')>>>

print p.match('ababababab').span()(0, 10

)

Groups are specified with "(" and ")" and get the start and end indices of the text they match; this can be done with one argument using group(), start(), end() and span() Retrieve. Groups are counted starting from 0. Group 0 is always present; it is the entire RE, so the methods of `MatchObject` all take group 0 as their default argument. Later we'll see how to express spans that cannot get the text they match.

#!python

#>>> p = re.compile('(a)b')

>>> m = p.match('ab')

##>>>
m.group()

'
ab'

>>>
m.group(0 )

'
ab'
The panel is from the left Counting to the right, starting from 1. Groups can be nested. The numerical value of the count can be determined by counting the number of open brackets from left to right.

Code highlighting produced by Actipro CodeHighlighter (freeware)

#!python
#>>>
p = re.compile('(a(b)c)d')
>>>
m = p.match('abcd')##>>>

m.group(0)
'

abcd
'##>>>
m.group(
1)'
abc
'##>>> m.group(

2
)'b

'
group() You can enter multiple group numbers at one time, here In this case it will return a tuple containing the values ​​corresponding to those groups.

# !python

#>>> m.group(2,1,2)
(
'b', 'abc', 'b')
The groups() method returns a tuple containing all group strings, starting from
1 to the group number contained in .

#!python

#>>> ;
m.groups()(

'
abc', 'b')
Backreferences in the pattern allow you to specify the contents of a previous capturing group, The group must also be found at the current position in the string. For example, if the content of group 1 can be found at the current location, "1 succeeds otherwise it fails. Remember that Python strings also use backslashes to add data to allow the string to contain any characters, so when in RE Make sure to use raw strings when using backreferences.

For example, the following RE finds pairs of words in a string


## Code highlighting produced. by Actipro CodeHighlighter (freeware)

http://www.CodeHighlighter.com/




#!python

##>>>
p
= re.compile(r'("b"w+)"s+"1')##>>> p.search(

'
Paris in the the spring').group()'the the

'
It's not common to just search for a backreference to a string like this - use this Text formats that repeat data in this way are rare -- but you'll soon find them useful for string replacement. No capturing groups and named groupsCarefully designed. REs may use many groups, both to capture substrings of interest and to group and structure the RE itself. In complex REs, keeping track of group numbers becomes difficult. Two features can help with this problem. They also both use a common syntax for regular expression expansion, so let's look at the first one.


Perl 5 adds several additional features to standard regular expressions, most of which are supported by Python's re module. It can be difficult to choose a new single-key metacharacter or a special sequence starting with """ to represent new functionality without confusing Perl regular expressions with standard regular expressions. If you choose "& " As a new metacharacter, for example, old expressions consider "&" to be a normal character and will not escape it when using "& or [&].


The solution for Perl developers is to use (?...) as the extension syntax. "?" directly after the bracket will cause a syntax error, because "?" does not have any characters to repeat, so it will not create any compatibility issues. The character immediately following "?" indicates the purpose of the extension, so (?=foo)


Python has added an extension syntax to the Perl extension syntax. You can tell it's an extension for Python if the first character after the question mark is "P". There are currently two such extensions: (?P...) defines a named group, and (?P=name) is a reverse reference to the named group. If a future version of Perl 5 adds the same functionality using a different syntax, then the re module will also change to support the new syntax, which is maintained as a Python-specific syntax for compatibility purposes.


Now that we look at the normal extension syntax, we go back to simplifying the features of using group operations in complex REs. Because groups are numbered from left to right, and a complex expression may use many groups, it can make it difficult to keep track of the current group number, and modifying such a complex RE can be cumbersome. Inserts a new group at the beginning and you can change each group number after it.


First of all, sometimes you want to use a group to collect part of the regular expression, but you are not interested in the content of the group. You can achieve this functionality with a non-capturing group: (?:...) so that you can send any other regular expression in parentheses.

#!python

#>>> m = re.match("([abc])+", "abc")

>>> m.groups()
(
'c',)

##>> >
m = re.match("(?:[abc])+ ", "abc")

>>>
m.groups()()

In addition to capturing the contents of the matching group, there are no capturing groups and capturing groups It behaves exactly the same; you can put any characters in it, you can repeat it with repeating metacharacters like "*", you can nest it in other groups (non-capturing groups and capturing groups). (?:...) is especially useful for modifying existing groups, since you can add a new group without changing all other group numbers. There is also no difference in search efficiency between capturing and non-capturing groups, neither one is faster than the other.

Secondly, more important and powerful is named groups; instead of specifying groups with numbers, they can be specified with names.


The syntax of the command group is one of the Python-specific extensions: (?P...). The name is obviously the name of the group. A named group is the same as a capturing group, except that the group has a name. The methods of `MatchObject` when dealing with capturing groups accept either an integer representing the group number or a string containing the group name. Named groups can also be numbers, so you can get information about a group in two ways:

#!python

## >>> p = re.compile(r'(?P "b"w+"b)')

>>> m = p.search( '(((Lots of punctuation)))' )

>>> m.group('word')

'Lots'

>>> m.group(1)

##'
Lots'Named groups are convenient because they allow you to use easy-to-remember names instead of having to remember numbers. Here's one from the imaplib module. RE example:

Code highlighting produced by Actipro CodeHighlighter (freeware)

#!python
##InternalDate
=
re.compile(r'INTERNALDATE "'r
'
(?P [ 123][0-9])-(?P[A-Z][a-z][a-z])-'r
'
(?P[0-9][0-9][0-9][0-9])' r
'
(?P[0-9][0-9]):(?P[0-9][ 0-9]):(?P[0-9][0-9])'##r'

(?P[-+])(?P[0-9][0-9])(?P[0-9][0-9] )
'r'

"
')Obviously, getting m.group('zonem') is much easier than remembering to get group 9.


Because of the reverse reference syntax, expressions like (...)"1 represent the group number. In this case, there will naturally be a difference if the group name is used instead of the group number. There is also a Python extension: (?P=name), which can make the group content named name found again at the current position. In order to find repeated words, the regular expression ("b"w+)"s+"1 can also be written as (?P "b"w+)"s+(?P=word):

#!python

#>>> p = re.compile(r'(?P"b"w+)"s+ (?P=word)')

>>> p.search('Paris in the the spring').group()

' the the'

Forward delimiter

Another zero-width assertion is the forward delimiter. The forward delimiter includes the forward positive delimiter and the backward positive delimiter, as shown below:

(?=...)

Forward positive delimiter symbol. If the contained regular expression, denoted by... , succeeds at the current position, otherwise it fails, but once the contained expression has been tried, the matching engine does not improve at all; the remainder of the pattern still has to be defined. The right side of the delimiter.

(?!...)

The opposite of the positive delimiter; when the contained expression cannot be in the string. Succeed when current position matches

It helps to demonstrate where forwarding can succeed. Consider a simple pattern that matches a filename and splits it into a basename and a "." There are two parts to the extension. For example, in "news.rc", "news" is the base name and "rc" is the file extension.

The matching pattern is very simple:

.


.*[.].*$

Note that "." requires special treatment because it is a metacharacter; i Put it in a character class. Also note the $; at the end. This is added to ensure that all remaining parts of the string must be included in the extension. This regular expression matches


. *[.][^b].*$

The first attempt above to remove "bat" requires that the first character of the extension be something other than "b". This is wrong because the pattern cannot match "foo.bar" either.

. *[.]([^b]..|.[ ^a].|..[^t])$

The expression gets even messier when you try to patch the first solution to require matching one of the following: the first character of the extension is not "b"; the second character is not "a"; or The third character is not "t". This would accept "foo.bar" but reject "autoexec.bat", but would only require a three-character extension and would not accept a two-character extension such as "sendmail.cf". We're going to complicate the pattern again as we work on patching it.

. *[.]([^b].?.?|.[^a]?.?|..?[^t]? )$

In the third attempt, both the second and third letters were made optional to allow matching extensions shorter than three characters, such as "sendmail. cf".


The schema is now very complex, which makes it hard to read. Worse, if the problem changes and you want extensions other than "bat" and "exe", the pattern becomes even more complicated and confusing.


Forward negation clips all of this to:

.*[.](?!bat$).*$

Forward meaning: If the expression bat does not match here, try the rest of the pattern; if bat$ matches, the entire pattern will fail. The trailing $ is required to ensure that extensions starting with "bat" like "sample.batch" are allowed.


It is now also easy to exclude another file extension; simply make it optional in the delimiter. The pattern below will exclude filenames ending in "bat" or "exe".

. *[.](?!bat$|exe$).*$

Modify String

So far, we have simply searched for a static string. Regular expressions are also usually used in different ways to modify strings through the `RegexObject` method below.

Method/Property Function
split() Put the string in Slice where RE matches and generate a list,
sub() Find all substrings matched by RE and replace them with a different string
subn() Same as sub(), but returns the new string and the number of substitutions

Split the string

The split() method of `RegexObject` splits the string where RE matches and will return a list. It is similar to the string split() method but provides more delimiters; split() only supports whitespace and fixed strings. As you might expect, there is also a module-level re.split() function.

split(string [ , maxsplit = 0])

Split the string through regular expressions. If capturing brackets are used in a RE, their contents are also returned as part of the result list. If maxsplit is non-zero, then at most maxsplit shards can be split.


You can limit the number of splits by setting the maxsplit value. When maxsplit is non-zero, there can be at most maxsplit splits, and the remainder of the string is returned as the last part of the list. In the following examples, the delimiter can be any sequence of non-alphanumeric characters.

#!python

>>> p = re.compile(r'"W+')

>>> p.split('This is a test, short and sweet, of split().')

[
'This''is''a''test''short''and''sweet''of''split''']

>>> p.split('This is a test, short and sweet, of split().'3)

[
'This''is''a''test, short and sweet, of split().']

有时,你不仅对定界符之间的文本感兴趣,也需要知道定界符是什么。如果捕获括号在 RE 中使用,那么它们的值也会当作列表的一部分返回。比较下面的调用:

# !python

#>>> p = re.compile(r'"W+')

##> >>
p2 = re.compile(r'("W+)')
##>>>
p.split('This is a test.')[
'
This ', 'is', 'a', 'test', '']
>>>
p2.split('This is a test.')[
'
This' , ' ', 'is', ' ', 'a', ' ', ' test', '.', '']
The module-level function re.split() takes RE as the first parameter, and the others are the same.

# !python

#>>> re.split('["W]+', 'Words, words, words.' )
[
'Words', 'words', 'words', '']

##>>>
re.split('(["W]+)', 'Words, words, words. ')[

'
Words', ' , ', 'words', ', ', 'words', '.', '']

>>>
re.split('["W]+ ', 'Words, words, words.', 1)[

'
Words', ' words, words.']
Search and Replace

Other common uses are to find all patterns Matches strings and replaces them with different strings. The sub() method provides a replacement value, which can be a string or a function, and a string to be processed.

Code highlighting produced by Actipro CodeHighlighter (freeware)
sub(replacement, string[, count
= 0])

The returned string is replaced with the leftmost non-repeating match of RE in the string. If the pattern is not found, the character will be returned unchanged.


The optional parameter count is the maximum number of substitutions after pattern matching; count must be a non-negative integer. The default value is 0 which replaces all matches.


Here is a simple example using the sub() method. It replaces the color name with the word "colour".

#!python

#>>> p = re.compile( '(blue|white|red)')

>>> p.sub( 'colour', 'blue socks and red shoes')

' color socks and color shoes'

##>>>
p.sub( 'colour', 'blue socks and red shoes', count=1)

'
colour socks and red shoes'
subn() method has the same effect, but returns a two-tuple containing the new string and the number of replacement execution times.

# !python

#>>> p = re.compile( '(blue|white|red)')

>>> p.subn( 'colour', 'blue socks and red shoes')
(
'colour socks and color shoes', 2)

##>>>
p.subn( 'colour', ' no colors at all')(

'
no colors at all', 0)
Empty matches will only be replaced if they are not immediately following a previous match.

Code highlighting produced by Actipro CodeHighlighter (freeware)

#!python
#>>>
p = re.compile('x*')## >>>

p.sub(
'-', 'abxd')'

-a-b-d-
'#If the replacement is a string, any backslashes in it will be processed. ""n" will be converted to a newline character, ""r" to a carriage return, etc. Unknown escapes such as ""j" are left intact. Backreferences, such as ""6", are matched by the corresponding group in the RE and replaced by the substring. This allows you to insert part of the original text into the replaced string.

This example matches the word "section" enclosed by "{" and "}", and replaces "section" with "subsection".

# !python

#>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)

>>> p.sub(r'subsection{"1} ','section{First} section{second}')

'subsection{First} subsection{second}'

You can also specify using (?P. ..) syntax defines a named group. ""g" will match a substring by the group name "name", and ""g" uses the corresponding group number. " is equal to ""2", but can have unclear meaning in the replacement string, such as ""g<2>0". (""20" is interpreted as a reference to group 20, not to group 2 followed by a letter "0".)

# !python

#>>> p = re.compile('section{ (?P [^}]* ) }', re. VERBOSE)

>>> p.sub(r'subsection{"1}','section{First}')

'subsection{First}'

#>>>
p.sub(r'subsection{"g<1>}', 'section{First}')

'
subsection{First}'
##>>>
p.sub(r' subsection{"g}','section{First}')
'
subsection{First}'##The replacement can also be a function which gives you even more control. If the replacement is a function, the function will be called for every unique match in the pattern, and the function will be called as a `MatchObject. ` matching function, and can use this information to calculate the expected string and return it

In the following example, the replacement function translates decimal to hexadecimal:

# !python

#>>> def hexrepl( match ):

"Return the hex string for a decimal number"

value
= int(match.group())

return hex(value)



#>>>
p = re.compile(r'"d+')
##> ;>>
p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.' )
'
Call 0xffd2 for printing, 0xc000 for user code.'
When using the module-level re.sub() function, the pattern is passed as the first argument. The pattern may be a string or a `RegexObject`; if you need to specify regular expression flags, you You must either use `RegexObject` as the first argument, or use the modifier inline using the pattern, such as sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'.
FAQ

Regular expressions are a powerful tool for some applications, but sometimes they are not intuitive and sometimes they don't work as you expect. This section will point out some of the most common ones. Common mistakes that are easy to make.

Using the string method

Sometimes using the re module is a mistake. If you are matching a fixed string or a single character class, and you are not using any of re. As the IGNORECASE flag does, there is no need to use regular expressions. Strings have some methods that operate on fixed strings. They are usually much faster because they are small optimized C loops that replace larger, more general regular expression engines.

Give an example of replacing one fixed string with another; for example, you can replace "deed" with "word". re.sub() seems like the function to use for this, but consider the replace() method. Note that replace() can also replace words, such as "swordfish" into "sdeedfish", but RE can also be used. Arrived. (To avoid replacing part of a word, the pattern would be written "bword"b, which requires a word boundary on either side of "word". This is a job beyond the capabilities of replacement).

Another common task is to remove a single character from a string or replace it with another character. You could probably do it with something like re.sub('"n',' ',S), but translate() does both tasks and is faster than any regular expression.

In short, before using the re module, consider whether your problem can be solved with a faster and simpler string method
.

match() vs search()

The match() function only checks whether the RE matches at the beginning of the string, while search() scans the entire string. It's important to remember this distinction. Remember, match() will only report a successful match, which will start at 0; if the match does not start at 0, match() will not report it.

#!python

#>>> print re.match('super', 'superstition ').span()
(0,
5)

>>> print re.match('super', 'insuperable')
None

Search(), on the other hand, will scan the entire string and report the first match it finds.


#!python

#>>>
print re.search('super', 'superstition ').span()(0,

5
)

>>>
print re.search('super', 'insuperable').span()(

2
, 7)Sometimes you may be inclined to use re.match(), only when Add .* to the front part of RE. Please try not to do this and use re.search() instead. The regular expression compiler does some analysis on REs to speed up processing when finding matches. One such parser will indicate what the first character of the match is; for example, the pattern Crow must match starting at "C". The parser allows the engine to quickly scan the string to find the starting character, and only start matching all after "C" is found.

Adding .* will cause this optimization to fail, which requires scanning to the end of the string and then backtracking to find a match for the remainder of the RE. Use re.search() instead.

Greedy vs Not Greedy

When repeating a regular expression, such as using a*, the result of the operation is to match as many patterns as possible. This fact often bothers you when you try to match a symmetrical pair of delimiters, such as angle brackets in HTML markup. Patterns matching a single HTML tag do not work properly because of the "greedy" nature of .*

#!python

##> ;>> s = ' Title'

>>> len(s)

32

#>>>
print re.match('<.*>', s).span()(0,

32
)

>>>
print re.match('<.*>', s).group()

<
html><head><title>Titletitle>
RE matches "<" in "

<span style="font-family:新宋体">", .* consumes the remaining part of the substring. Keeping more left in RE, though > cannot match at the end of the string, so the regex has to backtrack character by character until it finds a match for >. The final match goes from "<" in "<html" to ">" in "</title>", which is not what you want. </span>

The solution in this case is to use non-greedy qualifiers *?, +?, ?? or {m,n}? to match as small a text as possible. In the example above, ">" is tried immediately after the first "<", and when it fails, the engine increments one character at a time and retries ">" at each step. This processing will get the correct result:


Code highlighting produced by Actipro CodeHighlighter (freeware)#!python
##pat
= re.compile(r"""##"s*                    # Skip leading whitespace

(?P

[^: ]+) # Header name

"s* : # Whitespace, and a colon

(?P.*?) # The header's value -- *? used to

# lose the following trailing whitespace

"s*$                   # Trailing whitespace to end-of-line



""" , re.VERBOSE)This one is much harder to read:

Code highlighting produced by Actipro CodeHighlighter (freeware)

#!python
##pat
=
re.compile(r""s*(?P <header>[^:]+)"s*:(?P.*?)"s* $")Feedback

Regular expressions are a complex subject. Can this article help you understand? Are those parts unclear, or are the issues you're having not found here? If that's the case, please send suggestions to the author for improvements.

The most comprehensive book describing regular expressions is "Mastering Regular Expressions" written by Jeffrey Friedl, published by O'Reilly. Unfortunately, this book only focuses on Perl and Java-style regular expressions and does not contain any Python material, so it is not sufficient as a reference for Python programming. (The first version included Python's now obsolete regex module, which is naturally of little use).

The above is the detailed content of A more detailed guide to Python regular expression operations. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template