Home > Backend Development > Python Tutorial > How to obtain network data using Python web crawler

How to obtain network data using Python web crawler

WBOY
Release: 2023-05-12 11:04:06
forward
1415 people have browsed it

Using Python to obtain network data

Using the Python language to obtain data from the Internet is a very common task. Python has a library called requests, which is an HTTP client library for Python that is used to make HTTP requests to web servers.

We can use the requests library to initiate an HTTP request to the specified URL through the following code:

import requests
response = requests.get(&#39;<http://www.example.com>&#39;)
Copy after login

Among them, the response object will contain the response returned by the server. Use response.text to get the text content of the response.

In addition, we can also use the following code to obtain binary resources:

import requests
response = requests.get(&#39;<http://www.example.com/image.png>&#39;)
with open(&#39;image.png&#39;, &#39;wb&#39;) as f:
    f.write(response.content)
Copy after login

Use response.content to obtain the binary data returned by the server.

Writing crawler code

A crawler is an automated program that can crawl web page data through the network and store it in a database or file. Crawlers are widely used in data collection, information monitoring, content analysis and other fields. The Python language is a commonly used language for crawler writing because it has the advantages of being easy to learn, having a small amount of code, and rich libraries.

We take "Douban Movie" as an example to introduce how to use Python to write crawler code. First, we use the requests library to get the HTML code of the web page, then treat the entire code as a long string, and use the capture group of the regular expression to extract the required content from the string.

The address of Douban Movie Top250 page is https://movie.douban.com/top250?start=0, where the start parameter indicates which movie to start from Start getting. A total of 25 movies are displayed on each page. If we want to obtain the Top250 data, we need to visit a total of 10 pages. The corresponding address is https://movie.douban.com/top250?start=xxx, here If the xxx is 0, it is the first page. If the value of xxx is 100, then we can access the fifth page.

We take getting the title and rating of a movie as an example. The code is as follows:

import re
import requests
import time
import random
for page in range(1, 11):
    resp = requests.get(
        url=f&#39;<https://movie.douban.com/top250?start=>{(page - 1) * 25}&#39;,
        headers={&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36&#39;}
    )
    # 通过正则表达式获取class属性为title且标签体不以&开头的span标签并用捕获组提取标签内容
    pattern1 = re.compile(r&#39;<span class="title">([^&]*?)</span>&#39;)
    titles = pattern1.findall(resp.text)
    # 通过正则表达式获取class属性为rating_num的span标签并用捕获组提取标签内容
    pattern2 = re.compile(r&#39;<span class="rating_num".*?>(.*?)</span>&#39;)
    ranks = pattern2.findall(resp.text)
    # 使用zip压缩两个列表,循环遍历所有的电影标题和评分
    for title, rank in zip(titles, ranks):
        print(title, rank)
    # 随机休眠1-5秒,避免爬取页面过于频繁
    time.sleep(random.random() * 4 + 1)
Copy after login

In the above code, we use regular expressions to get the span tag whose tag body is the title and rating. And use capturing groups to extract tag content. Use zip to compress both lists, looping through all movie titles and ratings.

Use IP proxy

Many websites are disgusted with crawlers, because crawlers consume a lot of their network bandwidth and create a lot of invalid traffic. In order to hide your identity, you usually need to use an IP proxy to access the website. Commercial IP proxies (such as Mushroom Proxy, Sesame Proxy, Fast Proxy, etc.) are a good choice. Using commercial IP proxies can prevent the crawled website from obtaining the real IP address of the source of the crawler program, making it impossible to simply use the IP address. The crawler program is blocked.

Taking Mushroom Agent as an example, we can register an account on the website and then purchase the corresponding package to obtain a commercial IP agent. Mushroom proxy provides two ways to access the proxy, namely API private proxy and HTTP tunnel proxy. The former obtains the proxy server address by requesting the API interface of Mushroom proxy, and the latter directly uses the unified proxy server IP and port.

The code for using IP proxy is as follows:

import requests
proxies = {
    &#39;http&#39;: &#39;<http://username:password@ip>:port&#39;,
    &#39;https&#39;: &#39;<https://username:password@ip>:port&#39;
}
response = requests.get(&#39;<http://www.example.com>&#39;, proxies=proxies)
Copy after login

Among them, username and password are the username and password of the mushroom proxy account respectively,ip and port are the IP address and port number of the proxy server respectively. Note that different proxy providers may have different access methods and need to be modified accordingly according to the actual situation.

The above is the detailed content of How to obtain network data using Python web crawler. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:yisu.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template