html - regular expression python crawler

Question

import urllib.request req = urllib.request.urlopen('http://search.jd.com/Search?k...') reqOut[3]: &lt;http.client.HTTPResponse at 0x52bf6d8&gt; buf = req.read() buf = buf.decode('utf-8') urlli

阿神 · Answer

Mainly because, when you do not add (), re.findall will print out all the matches, but if you add (), it will print the matching, which is () Captured results, so you see a bunch of jpg/png. Because of this, we need to use () to capture all the matching links so that they can be printed. At the same time, we need to use (?:jpg |png), because what this place needs is to match jpg or png, so we need to use non-capturing grouping syntax.

# 代码修改
urllist = re.findall(r'(//img.+?.(?:png|jpg))',buf)

For more about capture grouping/non-capturing grouping, you can refer to: Link description

# 代码修改
urllist = re.findall(r'(//img.+?.(?:png|jpg))',buf)

For more about capture grouping/non-capturing grouping, you can refer to: Link description

代言 · Answer

[png|jpg]

(png|jpg) will be grouped

import re
import requests

r = requests.get('http://search.jd.com/Search?keyword=%E6%96%87%E8%83%B8&enc=utf-8&wq=%E6%96%87%E8%83%B8&pvid=4anf50si.fbrh68')
print re.findall('//img.+.[png|jpg]', r.text)