How to obtain network data using Python web crawler-Python Tutorial-php.cn

Table of Contents

Using Python to obtain network data

Writing crawler code

Use IP proxy

Home

Backend Development

Python Tutorial

How to obtain network data using Python web crawler

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 12, 2023 am 11:04 AM

python

Using Python to obtain network data

Using the Python language to obtain data from the Internet is a very common task. Python has a library called requests, which is an HTTP client library for Python that is used to make HTTP requests to web servers.

We can use the requests library to initiate an HTTP request to the specified URL through the following code:

import requests
response = requests.get(&#39;<http://www.example.com>&#39;)

Copy after login

Among them, the response object will contain the response returned by the server. Use response.text to get the text content of the response.

In addition, we can also use the following code to obtain binary resources:

import requests
response = requests.get(&#39;<http://www.example.com/image.png>&#39;)
with open(&#39;image.png&#39;, &#39;wb&#39;) as f:
    f.write(response.content)

Copy after login

Use response.content to obtain the binary data returned by the server.

Writing crawler code

A crawler is an automated program that can crawl web page data through the network and store it in a database or file. Crawlers are widely used in data collection, information monitoring, content analysis and other fields. The Python language is a commonly used language for crawler writing because it has the advantages of being easy to learn, having a small amount of code, and rich libraries.

We take "Douban Movie" as an example to introduce how to use Python to write crawler code. First, we use the requests library to get the HTML code of the web page, then treat the entire code as a long string, and use the capture group of the regular expression to extract the required content from the string.

The address of Douban Movie Top250 page is https://movie.douban.com/top250?start=0, where the start parameter indicates which movie to start from Start getting. A total of 25 movies are displayed on each page. If we want to obtain the Top250 data, we need to visit a total of 10 pages. The corresponding address is https://movie.douban.com/top250?start=xxx, here If the xxx is 0, it is the first page. If the value of xxx is 100, then we can access the fifth page.

We take getting the title and rating of a movie as an example. The code is as follows:

import re
import requests
import time
import random
for page in range(1, 11):
    resp = requests.get(
        url=f&#39;<https://movie.douban.com/top250?start=>{(page - 1) * 25}&#39;,
        headers={&#39;User-Agent&#39;: &#39;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36&#39;}
    )
    # 通过正则表达式获取class属性为title且标签体不以&开头的span标签并用捕获组提取标签内容
    pattern1 = re.compile(r&#39;<span class="title">([^&]*?)</span>&#39;)
    titles = pattern1.findall(resp.text)
    # 通过正则表达式获取class属性为rating_num的span标签并用捕获组提取标签内容
    pattern2 = re.compile(r&#39;<span class="rating_num".*?>(.*?)</span>&#39;)
    ranks = pattern2.findall(resp.text)
    # 使用zip压缩两个列表，循环遍历所有的电影标题和评分
    for title, rank in zip(titles, ranks):
        print(title, rank)
    # 随机休眠1-5秒，避免爬取页面过于频繁
    time.sleep(random.random() * 4 + 1)

Copy after login

In the above code, we use regular expressions to get the span tag whose tag body is the title and rating. And use capturing groups to extract tag content. Use zip to compress both lists, looping through all movie titles and ratings.

Use IP proxy

Many websites are disgusted with crawlers, because crawlers consume a lot of their network bandwidth and create a lot of invalid traffic. In order to hide your identity, you usually need to use an IP proxy to access the website. Commercial IP proxies (such as Mushroom Proxy, Sesame Proxy, Fast Proxy, etc.) are a good choice. Using commercial IP proxies can prevent the crawled website from obtaining the real IP address of the source of the crawler program, making it impossible to simply use the IP address. The crawler program is blocked.

Taking Mushroom Agent as an example, we can register an account on the website and then purchase the corresponding package to obtain a commercial IP agent. Mushroom proxy provides two ways to access the proxy, namely API private proxy and HTTP tunnel proxy. The former obtains the proxy server address by requesting the API interface of Mushroom proxy, and the latter directly uses the unified proxy server IP and port.

The code for using IP proxy is as follows:

import requests
proxies = {
    &#39;http&#39;: &#39;<http://username:password@ip>:port&#39;,
    &#39;https&#39;: &#39;<https://username:password@ip>:port&#39;
}
response = requests.get(&#39;<http://www.example.com>&#39;, proxies=proxies)

Copy after login

Among them, username and password are the username and password of the mushroom proxy account respectively,ip and port are the IP address and port number of the proxy server respectively. Note that different proxy providers may have different access methods and need to be modified accordingly according to the actual situation.

The above is the detailed content of How to obtain network data using Python web crawler. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7613

CakePHP Tutorial

1387

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

136

Related knowledge

Is the vscode extension malicious? Apr 15, 2025 pm 07:57 PM

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

How to run programs in terminal vscode Apr 15, 2025 pm 06:42 PM

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

Can vs code run in Windows 8 Apr 15, 2025 pm 07:24 PM

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

Can visual studio code be used in python Apr 15, 2025 pm 08:18 PM

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

Can vscode be used for mac Apr 15, 2025 pm 07:36 PM

VS Code is available on Mac. It has powerful extensions, Git integration, terminal and debugger, and also offers a wealth of setup options. However, for particularly large projects or highly professional development, VS Code may have performance or functional limitations.

PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

Can vscode run ipynb Apr 15, 2025 pm 07:30 PM

The key to running Jupyter Notebook in VS Code is to ensure that the Python environment is properly configured, understand that the code execution order is consistent with the cell order, and be aware of large files or external libraries that may affect performance. The code completion and debugging functions provided by VS Code can greatly improve coding efficiency and reduce errors.

See all articles