Crawler parsing method three: regular expressions-Python Tutorial-php.cn

众多语言都能进行爬虫，但基于python的爬虫显得更加简洁，方便。爬虫也成了python语言中必不可少的一部分。爬虫的解析方式也是多种多样。

上一篇给大家讲解的是爬虫的解析方式二：Beautifulsoup，今天给带给大家的是正则表达式。

Crawler parsing method three: regular expressions

正则表达式

正则表达式是一个特殊的字符序列，它能帮助你方便的检查一个字符串是否与某种模式匹配。就是事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符”，这个“规则字符” 来表达对字符的一种过滤逻辑。

正则并不是python独有的，其他语言也都有正则。

python中的正则，封装了re模块

Python中常用的正则表达式处理函数

re.match函数

re.match 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。

函数语法：

re.match(pattern, string, flags=0)

Copy after login

函数参数说明：

参数描述

pattern 匹配的正则表达式

string 要匹配的字符串。

flags 标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等。

匹配成功re.match方法返回一个匹配的对象，否则返回None。

我们可以使用group(num) 或 groups() 匹配对象函数来获取匹配表达式。

匹配对象方法 描述

group(num=0) 匹配的整个表达式的字符串，group() 可以一次输入多个组号，

在这种情况下它将返回一个包含那些组所对应值的元组。

groups() 返回一个包含所有小组字符串的元组，从 1 到所含的小组号。

import re
print(re.match(&#39;www&#39;, &#39;www.baidu.com&#39;).span())  # 在起始位置匹配
print(re.match(&#39;com&#39;, &#39;www.baidu.com&#39;))         # 不在起始位置匹配

Copy after login

以上实例运行输出结果为：

(0, 3)
None

Copy after login

import re
content = "Cats are smarter than dogs"
result = re.match( r&#39;(.*) are (.*?) .*&#39;, content)
print(result.group())
print(result.group(1))
print(result.group(2))

Copy after login

以上实例执行结果如下：

Cats are smarter than dogs
Cats
smarter
result.group()获取匹配的结果
result.span()获去匹配字符串的长度范围

Copy after login

泛匹配

其实相对来说上面的方式并不是非常方便，其实可以将上述的正则规则进行更改

import re
content = "Cats are smarter than dogs"
result = re.match( r&#39;Cats.*dogs$&#39;, content)
print(result)
print(result.group())
print(result.span())

Copy after login

匹配目标

如果为了匹配字符串中具体的目标，则需要通过（）括起来，例子如下：

import re
content = "Cats are 1234567 smarter than dogs"
result = re.match( r&#39;(.*)\sare\s(\d+)\s(.*?)\s.*&#39;, content) #\s匹配空格符 \d+匹配数字
print(result.group())
print(result.group(1))
print(result.group(2))

Copy after login

以下为执行结果：

Cats are smarter than dogs

Cats

1234567

贪婪匹配

先看下面代码：

import re
content = "Cats are 1234567 smarter than dogs"
result = re.match( r&#39;Cats.*(\d+).*dogs&#39;, content) 
print(result.group())
print(result.group(1))

Copy after login

从结果中可以看出只匹配到了7，并没有匹配到1234567，出现这种情况的原因是前面的.* 给匹配掉了， .*在这里会尽可能的匹配多的内容，也就是我们所说的贪婪匹配，

如果我们想要匹配到1234567则需要将正则表达式改为：

result = re.match( r'Cats.*？(\d+).*dogs', content)

这样结果就可以匹配到1234567

匹配模式

很多时候匹配的内容是存在换行的问题的，这个时候的就需要用到匹配模式re.S来匹配换行的内容

import re
content = """Cats are 1234567 smarter than dogs
dogs are wangwangwang"""
result = re.match( r&#39;Cats.*(\d+).*wangwangwang&#39;, content,re.S) 
print(result.group())
print(result.group(1))

Copy after login

转义

当我们要匹配的内容中存在特殊字符的时候，就需要用到转移符号\,例子如下：

import re
content= "price is $5.00"
result = re.match(&#39;price is \$5\.00&#39;,content)
print(result.group())

Copy after login

注意：

对上面的一个小结：

尽量使用泛匹配，使用括号得到匹配目标，尽量使用非贪婪模式，有换行符就用re.S

强调re.match是从字符串的起始位置匹配一个模式

re.search方法

re.search 扫描整个字符串并返回第一个成功的匹配。

函数语法：

re.search(pattern, string, flags=0)

Copy after login

函数参数说明：

参数描述

pattern 匹配的正则表达式

string 要匹配的字符串。

flags 标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等。

匹配成功re.search方法返回一个匹配的对象，否则返回None。

我们可以使用group(num) 或 groups() 匹配对象函数来获取匹配表达式。

匹配对象方法 描述

group(num=0) 匹配的整个表达式的字符串，group() 可以一次输入多个组号，

在这种情况下它将返回一个包含那些组所对应值的元组。

groups() 返回一个包含所有小组字符串的元组，从 1 到所含的小组号。

import re
content = "extra things hello 123455 world_this is a Re Extra things"
result = re.search("hello.*?(\d+).*?Re",content)
print(result.group())
print(result.group(1)

Copy after login

其实这个时候我们就不需要在写^以及$，因为search是扫描整个字符串

注意：所以为了匹配方便，我们会更多的用search，不用match,match必须匹配头部，所以很多时候不是特别方

re.match与re.search的区别

re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。

html = &#39;&#39;&#39;<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>&#39;&#39;&#39;

Copy after login

import re
 
result = re.search(&#39;<li.*?active.*?singer="(.*?)">(.*?)</a>&#39;,html,re.S)
print(result.group(1), result.group(2))

Copy after login

观察到

节点，其中
节点有的包含节点，有的不包含节点，节点还有一些相应的属性，超链接和歌手名。
首先我们尝试提取class为active的
节点内部的超链接包含的歌手名和歌名。
所以我们需要提取第三个
节点下的节点的singer属性和文本。
所以正则表达式可以以
开头，然后接下来寻找一个标志符active，中间的部分可以用.*?来匹配，然后接下来我们要提取singer这个属性值，所以还需要写入singer="(.*?)"，我们需要提取的部分用小括号括起来，以便于用group()方法提取出来，它的两侧边界是双引号，然后接下来还需要匹配节点的文本，那么它的左边界是>，右边界是，所以我们指定一下左右边界，然后目标内容依然用(.*?)来匹配，所以最后的正则表达式就变成了(.*?)'，然后我们再调用search()方法，它便会搜索整个HTML文本，找到符合正则表达式的第一个内容返回。
另外由于代码有换行，所以这里第三个参数需要传入re.S
注意：在上面两次匹配中，search()方法的第三个参数我们都加了re.S，使得.*?可以匹配换行，所以含有换行的
节点被匹配到了，如果我们将其去掉，只会匹配到不换行的的内容
re.findall
搜索整个字符串然后返回匹配正则表达式的所有内容
```
html = &#39;&#39;&#39;<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>&#39;&#39;&#39;
```
Copy after login
Copy after login
```
import re
results = re.findall(&#39;<li.*?href="/(.*?)".*?singer="(.*?)">(.*?)</a>&#39;, html, re.S)
for result in results:
    print(result)
    print(result[0], result[1], result[2])
```
Copy after login
运行结果：
('2.mp3', '任贤齐', '沧海一声笑')
2.mp3 任贤齐沧海一声笑
('3.mp3', '齐秦', '往事随风')
3.mp3 齐秦往事随风
('4.mp3', 'beyond', '光辉岁月')
4.mp3 beyond 光辉岁月
('5.mp3', '陈慧琳', '记事本')
5.mp3 陈慧琳记事本
('6.mp3', '邓丽君', '但愿人长久')
6.mp3 邓丽君但愿人长久
```
results = re.findall(&#39;<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>&#39;,html,re.S)
for result in results:
    #print(result)
    print(result[0], result[1], result[2])
```
Copy after login
运行结果：
一路上有你
沧海一声笑
往事随风
光辉岁月
记事本
但愿人长久
\s*? 这种用法其实就是为了解决有的有换行，有的没有换行的问题
()? 这种用法是因为html中有的有a标签，有的没有的，？表示匹配一个或0个，正好可以用于匹配
检索和替换
Python 的re模块提供了re.sub用于替换字符串中的匹配项。
语法：
re.sub(pattern, repl, string, count=0)
参数：
pattern : 正则中的模式字符串。
repl : 替换的字符串，也可为一个函数。
string : 要被查找替换的原始字符串。
count : 模式匹配后替换的最大次数，默认 0 表示替换所有的匹配。
```
import re
phone = "2004-959-559 # 这是一个电话号码"
# 删除注释
num = re.sub(r&#39;#.*$&#39;, "", phone)
print ("电话号码 : ", num)
 
# 移除非数字的内容
num = re.sub(r&#39;\D&#39;, "", phone)
print ("电话号码 : ", num)
```
Copy after login
在这里我们只需要在第一个参数传入\D来匹配所有的数字，然后第二个参数“”是替换成的字符串，要去掉的话就可以赋值为空，第三个参数phone就是原字符串。

re.compile
将正则表达式编译成正则表达式对象，方便复用该正则表达式
```
import re
content= "hello world fan"
 
pattern =re.compile("hello.*fan",re.S)
 
result1 = re.match(pattern,content)
result2 = re.search(pattern,content)
result3 = re.sub(pattern, &#39;&#39;, content)
print(result1, result2, result3)
```
Copy after login
compile()还可以传入修饰符，例如re.S等修饰符，这样在search()、findall()等方法中就不需要额外传了。所以compile()方法可以说是给正则表达式做了一层封装，以便于我们更好地复用。

正则表达式修饰符 - 可选标志

正则表达式可以包含一些可选标志修饰符来控制匹配的模式。修饰符被指定为一个可选的标志。多个标志可以通过按位 OR(|) 它们来指定。如 re.I | re.M 被设置成 I 和 M 标志：

修饰符 描述

re.I 使匹配对大小写不敏感

re.L 做本地化识别（locale-aware）匹配

re.M 多行匹配，影响 ^ 和 $

re.S 使 . 匹配包括换行在内的所有字符

re.U 根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B.

re.X 该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。

正则表达式模式

模式字符串使用特殊的语法来表示一个正则表达式：

字母和数字表示他们自身。一个正则表达式模式中的字母和数字匹配同样的字符串。

多数字母和数字前加一个反斜杠时会拥有不同的含义。

标点符号只有被转义时才匹配自身，否则它们表示特殊的含义。

反斜杠本身需要使用反斜杠转义。

由于正则表达式通常都包含反斜杠，所以你最好使用原始字符串来表示它们。模式元素(如 r'\t'，等价于 \\t )匹配相应的特殊字符。

下表列出了正则表达式模式语法中的特殊元素。如果你使用模式的同时提供了可选的标志参数，某些模式元素的含义会改变。

模式描述

^ 匹配字符串的开头

$ matching the end of the string.

. ” Matches any character, except newline characters, when the re.DOTALL flag is specified, it can match any character including newline characters.

[...]                                                                                                                                                                                                                                                              Not there Characters in []: [^abc] matches characters except a, b, c.

re* to match 0 or more expressions.

re to match one or more expressions.

re? ‐ ‐ ‐ ′ ‐ ‐ ‐ Match 0 or 1 fragments defined by the preceding regular expression, non-greedy way

re{ n} Exact match n previous expressions.

re{ n, m} # Matches a or b

(re) G match the expression in parentheses, which also means a group

# (? IMX) regular expression contains three optional signs: i, m, or x. Only affects the area in brackets.

(? -IMX) Turn off i, m, or X optional sign. Only affects the area in brackets.

(?: re)                                                                                                                                                                                                                                           Use i, m, or x optional flags in parentheses

(?-imx: re)                                                                                                                                                                                                                                                                                             Comment.

(?= re ) The front direction is affirmed. If the contained regular expression, represented by ..., succeeds if the current position successfully matches,

’ using using ‐ use ‐ ‐ ‐ ‐‐ ‐‐‐‐ successful if the current position is matched. But once the contained expression has been tried, the matching engine does not improve at all;
The remaining part of the
mode must also try to define the right side of the definition.

(?! Re) Define the definition. Contrary to the affirmative definition character; the independent mode of matching the independent mode of matching of

(? & GT; Rely) when the expression contained in the string is not matched, saving the retrospective.

\w Matched alphanumeric

\ Matches non-alphanumeric

\s Matches any whitespace character, equivalent to [\t\n\r\f].

\S ‐ ‐ ‐ ‐ ‐ ‐ Match any non-empty character

\d Matches any number, equivalent to [0-9].

\D Matches any non-number

\ a matching string starts

# \ Z match string to end. If there is a change of row, only match the end string before the change. c

\z                                                                            Match the end of the string

\G                                            Matches the position where the last match was completed.

\b to match a word boundary, that is, the position between a word and a space.

For example, 'er \ b' can match the 'er' in "never", but it cannot match the 'er' in "VERB".

\B to match non-word boundaries. 'er\B' matches 'er' in "verb", but not in "never".

\n, \t, etc. Matches a newline character. Matches a tab character. Wait

\1...\9

\10 to match the contents of the nth group, if it is matched. Otherwise it refers to the expression of the octal character code.

Regular expression example

Character matching

Example                                                                                                                                                                             #Character class
##Example
rub[ye]                          Matches "ruby" or "rube"

[aeiou]                 Matches any letter in the square brackets
[0-9]                         Matches any number. . Similar to [0123456789]
[a-z] Matching any lowercase letter
[A-Z] Matching any lowercase letter
[A-Z to Matches any letters and Number
[^AEIOU] All characters other than AEIOU letters [^0-9] matching characters except the number

Special Character Class

Example
Matches anything except "\n" any single character except . To match any character including '\n', use a pattern like '[.\n]'.

\d                                                                                                                                                                    Matches a numeric character. Equivalent to [0-9].

\D Matches a non-numeric character. Equivalent to [^0-9].

\s Matches any whitespace characters, including spaces, tabs, form feeds, etc. Equivalent to [ \f\n\r\t\v].

\S                                                                                         Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v].
\w                                                                                                    Matches any word character including an underscore. Equivalent to '[A-Za-z0-9_]'.
\W                                                                                                 Matches any non-word character. Equivalent to '[^A-Za-z0-9_]'.
The above is the detailed content of Crawler parsing method three: regular expressions. For more information, please follow other related articles on the PHP Chinese website!