As a concept, regular expressions are not unique to Python. However, there are still some minor differences in the actual use of regular expressions in Python.
This article is part of a series of articles about Python regular expressions. In this first article in this series, we will focus on how to use regular expressions in Python and highlight some of the unique features of Python.
We will introduce some methods of searching and finding strings in Python. Then we'll discuss how to use grouping to process the sub-items of the matching objects we find.
The module for regular expressions in Python that we are interested in using is usually called 're'.
>>> import re
1. Primitive type string in Python
The Python compiler uses '' (backslash) to represent escape characters in string constants.
If the backslash is followed by a string of special characters recognized by the compiler, then the entire escape sequence will be replaced by the corresponding special character (for example, 'n' will be replaced by a newline character by the compiler).
But this poses a problem for using regular expressions in Python, because backslashes are also used in the 're' module to escape special characters (such as * and +) in regular expressions.
The mixture of the two means that sometimes you have to escape the escape character itself (when the special character is recognized by both Python and the regular expression compiler), but other times you don't have to ( If special characters are only recognized by the Python compiler).
Instead of focusing on figuring out how many backslashes are needed, we can use raw strings instead.
Primitive type strings can be created simply by adding the character 'r' before the double quotes of an ordinary string. When a string is of primitive type, the Python compiler does not attempt any substitutions. Essentially, you are telling the compiler not to interfere with your string at all.
>>> string = 'This is a\nnormal string' >>> rawString = r'and this is a\nraw string' >>> print string
This is a normal string
>>> print rawString and this is a\nraw string
This is a primitive type string.
Search using regular expressions in Python
The 're' module provides several methods to perform exact queries on the input string. The methods we will discuss are:
•re.match() •re.search() •re.findall()
Each method receives a regular expression and a string to be matched. Let's look at each of these methods in more detail to understand how they work and how they differ.
2. Use re.match to search – matching starts
Let’s take a look at the match() method first. The way the match() method works is that it only finds a match if the beginning of the string being searched matches the pattern.
For example, calling the math() method on the string 'dog cat dog', the search pattern 'dog' will match:
>>> re.match(r'dog', 'dog cat dog') <_sre.SRE_Match object at 0xb743e720< >>> match = re.match(r'dog', 'dog cat dog') >>> match.group(0) 'dog'
We will discuss the group() method more later. For now, we just need to know that we called it with 0 as its argument, and that the group() method returns the matching pattern found.
I have also skipped the returned SRE_Match object for now, we will discuss it soon.
However, if we call the math() method on the same string, looking for the pattern 'cat', no match will be found.
>>> re.match(r'cat', 'dog cat dog') >>>
3. Use re.search to find – match any position
The search() method is similar to match(), but the search() method does not limit us to only find matches from the beginning of the string, so in Searching for 'cat' in our example string will find a match:
search(r'cat', 'dog cat dog') >>> match.group(0) 'cat'
However, the search() method stops after it finds a match, so in our example string we use The searc() method searches for 'dog' only at its first occurrence.
>>> match = re.search(r'dog', 'dog cat dog') >>> match.group(0) 'dog'
4. Use re.findall - all matching objects
The find method I use most in Python so far is the findall() method. When we call the findall() method, we can very simply get a list of all matching patterns instead of getting the match object (we will discuss the match object more next). For me it's simpler. Calling the findall() method on the example string we get:
['dog', 'dog'] >>> re.findall(r'cat', 'dog cat dog') ['cat']
5. Using the match.start and match.end methods
Then, the previous search() and match() methods previously returned to us the 'match What exactly is an 'object'?
Different from simply returning the matching part of a string, the "matching object" returned by search() and match() is actually a wrapper class for matching substrings.
Earlier you saw that I could get the matched substring by calling the group() method, (as we will see in the next section, match objects are actually very useful when dealing with grouping problems), but the match object also contains a lot more about the match Substring information.
For example, the match object can tell us where the matched content starts and ends in the original string:
>>> match = re.search(r'dog', 'dog cat dog') >>> match.start() >>> match.end()
Knowing this information is sometimes very useful
6. Use math. groupGroup by number
就像我之前提到的,匹配对象在处理分组时非常得心应手。
分组是对整个正则表达式的特定子串进行定位的能力。我们可以定义一个分组做为整个正则表达式的一部分,然后单独的对这部分对应匹配到的内容定位。
让我们来看一下它是怎么工作的:
>>> contactInfo = 'Doe, John: 555-1212'
我刚才创建的字符串类似一个从某人的地址本里取出来的一个片段。我们可以通过这样一个正则表达式来匹配这一行:
>>> re.search(r'\w+, \w+: \S+', contactInfo) <_sre.SRE_Match object at 0xb74e1ad8<
通过用圆括号来(字符‘('和‘)')包围正则表达式的特定部分,我们可以对内容进行分组然后对这些子组做单独处理。
>>> match = re.search(r'(\w+), (\w+): (\S+)', contactInfo)
这些分组可以通过用分组对象的group()方法得到。它们可以通过其在正则表达式中从左到右出现的数字顺序来定位(从1开始):
>>> match.group(1) 'Doe' >>> match.group(2) 'John' >>> match.group(3) '555-1212'
组的序数从1开始的原因是因为第0个组被预留来存放所有匹配对象(我们在之前学习match()方法和search()方法到时候看到过)。
>>> match.group(0) 'Doe, John: 555-1212'
7. 使用 match.group 通过别名来分组
有时候,特别是当一个正则表达式有很多分组的时候,通过组的出现次序来定位就会变的不现实。Python还允许你通过下面的语句来指定一个组名:
>>> match = re.search(r'(?P<last>\w+), (?P<first>\w+): (?P<phone>\S+)', contactInfo)
我们还是可以用group()方法获取分组的内容,但这时候我们要用我们所指定的组名而不是之前所使用的组的所在位数。
>>> match.group('last') 'Doe' >>> match.group('first') 'John' >>> match.group('phone') '555-1212'
这大大加强了代码的明确性和可读性。你可以想像当正则表达式变得越来越复杂,去弄懂一个分组到捕获了什么内容将会变得越来越困难。给你的分组命名将明确的告诉了你和你的读者你的意图。
尽管findall()方法不返回分组对象,它也可以使用分组。类似的,findall()方法将返回一个元组的集合,其中每个元组中的第N个元素对应了正则表达式中的第N个分组。
>>> re.findall(r'(\w+), (\w+): (\S+)', contactInfo) [('Doe', 'John', '555-1212')]
但是,给分组命名并不适用于findall()方法。
在本文中我们介绍了Python中使用正则表达式的一些基础。我们学习了原始字符串类型(还有它能帮你解决的在使用正则表达式中一些头痛的问题)。我们还学习了如何适使用match(), search(), and findall()方法进行基本的查询,以及如何使用分组来处理匹配对象的子组件。
和往常一样,如果想查看更多关于这个主题的内容,re模块的Python官方文档是一个非常好的资源。
在以后的文章中,我们将更深入的讨论Python中正则表达式的应用。我们将更加全面的学习匹配对象,学习如何使用它们在字符串中做替换,甚至使用它们从文本文件中去解析Python数据结构。