import urllib.request
req = urllib.request.urlopen('http://search.jd.com/Search?k...')
req
Out[3]: <http.client.HTTPResponse at 0x52bf6d8>
buf = req.read()
buf = buf.decode('utf-8')
urllist = re.findall(r'//img. .png',buf)
This will normally display the image URL ending in .png
urllist = re.findall(r'//img. .jpg ',buf)
Also basically normal
urllist = re.findall(r'//img. .(png|jpg)',buf)
This can only display the format of a series of pictures, like this :
'.jpg',
'.jpg',
'.png',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
'.jpg',
Why is this?
Mainly because, when you do not add
()
,re.findall
will print out all the matches, but if you add()
, it will print the matching, which is()
Captured results, so you see a bunch ofjpg/png
. Because of this, we need to use()
to capture all the matching links so that they can be printed. At the same time, we need to use(?:jpg |png)
, because what this place needs isto match jpg or png
, so we need to use non-capturing grouping syntax.For more about
capture grouping/non-capturing grouping
, you can refer to: Link description[png|jpg]
(png|jpg) will be grouped