Home > Backend Development > Python Tutorial > Getting started with Python crawler: crawling web images

Getting started with Python crawler: crawling web images

WBOY
Release: 2022-07-11 12:06:36
forward
2953 people have browsed it

This article brings you relevant knowledge about Python, which mainly organizes the related issues of crawling web images. In order to obtain data efficiently, crawlers are very easy to use, and Using python to make a crawler is also very simple and convenient. Let’s take a look at the basic process of writing a crawler through a simple small crawler program. Let’s take a look at it together. I hope it will be helpful to everyone.

Getting started with Python crawler: crawling web images

[Related recommendations: Python3 video tutorial ]

In this era of information explosion, if you want to obtain data efficiently, Crawlers are very useful. It is also very simple and convenient to use python to make a crawler. Let’s take a look at the basic process of writing a crawler through a simple small crawler program:

Preparation

Language: python

IDE: pycharm

The first is the library to be used, because it is the simplest program for just getting started. We mainly use the following two:

import requests //用于请求网页
import re  //正则表达式,用于解析筛选网页中的信息
Copy after login

Among them re comes with python, and the requests library needs to be installed by ourselves. Just enter pip install requests on the command line.

Then find a random website. Be careful not to try to crawl privacy-sensitive information. Here is an emoticon package website:

Note: The content in the emoticon package website here can be downloaded for free. , so the crawler just simplifies our process one by one. Be careful not to crawl paid resources.

#What we have to do is download these emoticons to our computer through a crawler.

Writing a crawler program

First of all, you must access this website through python. The code is as follows:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0'
    }
response = requests.get('https://qq.yh31.com/zjbq/',headers=headers)  //请求网页
Copy after login

The reason why you need to add the headers section is because some web pages will recognize you It is requested through python and then rejected, so we need to change the normal request header. You can find one at random or use f12 to copy one from the network information.

Then we need to find the location of the image we want to crawl in the web page code. Check the source code with f12 and find the emoticon package as follows:

Then create a matching rule and replace the middle string with a regular expression. The simplest one is.*?

t = '<img src="(.*?)" alt="(.*?)" width="160" height="120">'
Copy after login

Like this.

Then you can call the findall method in the re library to crawl down the relevant content:

result = re.findall(t, response.text)
Copy after login

The returned content is a list composed of strings. Finally, we crawled to the address through python Just download the image and save it to a folder.

Program code

import requests
import re
import os

image = '表情包'
if not os.path.exists(image):
    os.mkdir(image)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0'
    }
response = requests.get('https://qq.yh31.com/zjbq/',headers=headers)
response.encoding = 'GBK'
response.encoding = 'utf-8'
print(response.request.headers)
print(response.status_code)
t = '<img src="(.*?)" alt="(.*?)" width="160" height="120">'
result = re.findall(t, response.text)
for img in result:
    print(img)
    res = requests.get(img[0])
    print(res.status_code)
    s = img[0].split('.')[-1]  #截取图片后缀,得到表情包格式,如jpg ,gif
    with open(image + '/' + img[1] + '.' + s, mode='wb') as file:
        file.write(res.content)
Copy after login

The final result is like this:

##[Related recommendations:

Python3 video tutorial]

The above is the detailed content of Getting started with Python crawler: crawling web images. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:csdn.net
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template