Scrapy in action: crawling Baidu news data-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Scrapy in action: crawling Baidu news data

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 23, 2023 am 08:50 AM

news Crawling scrapy

Scrapy in action: Crawl Baidu news data

With the development of the Internet, the main way people obtain information has shifted from traditional media to the Internet, and people increasingly rely on the Internet to obtain news information. For researchers or analysts, a large amount of data is needed for analysis and research. Therefore, this article will introduce how to use Scrapy to crawl Baidu news data.

Scrapy is an open source Python crawler framework that can crawl website data quickly and efficiently. Scrapy provides powerful web page parsing and crawling functions, as well as good scalability and a high degree of customization.

Step 1: Install Scrapy

Before you start, you need to install Scrapy and some other libraries. The installation can be completed through the following command:

pip install scrapy
pip install requests
pip install bs4

Copy after login

Step 2: Create a Scrapy project

Create a Scrapy project through the following command:

scrapy startproject baiduNews

Copy after login

After the command is executed, A folder named baiduNews will be created in the current directory, which contains the initial structure of a Scrapy project.

Step 3: Write Spider

In Scrapy, Spider is a processor used to crawl web content. We need to write a Spider to obtain data from Baidu News website. First, we need to create a spiders folder in the project root directory and create a Python file in it to fit the spider template.

import scrapy

class BaiduSpider(scrapy.Spider):
    name = "baidu"
    start_urls = [
        "http://news.baidu.com/"
    ]

    def parse(self, response):
        pass

Copy after login

In the above code, we first imported the Scrapy library and created a class named BaiduSpider. In the class, we define a variable start_urls, which is a list containing Baidu News URLs. The parse method is the core function for performing data capture. Here, we just define an empty function. Now, we need to define a template to get the news data.

import scrapy
from baiduNews.items import BaidunewsItem
from bs4 import BeautifulSoup

class BaiduSpider(scrapy.Spider):
    name = "baidu"
    start_urls = [
        "http://news.baidu.com/"
    ]

    def parse(self, response):
        soup = BeautifulSoup(response.body, "html.parser")

        results = soup.find_all("div", class_="hdline_article_tit")
        for res in results:
            item = BaidunewsItem()
            item["title"] = res.a.string.strip()
            item["url"] = res.a.get("href").strip()
            item["source"] = "百度新闻"
            yield item

Copy after login

In the above code, we found all elements with class hdline_article_tit, which are the headlines of Baidu News. We then use BeautifulSoup to parse the page and create a BaidunewsItem class object in a loop. Finally, we return the data through the yield statement.

Step 4: Define Item

In Scrapy, Item is used to define the crawled data structure. We need to define an Item template in the items.py file in the project.

import scrapy

class BaidunewsItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    source = scrapy.Field()

Copy after login

Step 5: Start Spider and output data

We only need to run the following command to start the Spider and output data:

scrapy crawl baidu -o baiduNews.csv

Copy after login

After the command is executed, it will Create a file named baiduNews.csv in the project root directory, containing all crawled news data.

Summary

With Scrapy, we can quickly and efficiently obtain Baidu news data and save it locally. Scrapy has good scalability and supports output in multiple data formats. This article only introduces a simple application scenario of Scrapy, but Scrapy still has many powerful functions waiting for us to explore.

The above is the detailed content of Scrapy in action: crawling Baidu news data. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Two Point Museum: All Exhibits And Where To Find Them

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7378

Java Tutorial

1628

CakePHP Tutorial

1357

Laravel Tutorial

1267

PHP Tutorial

1216

Related knowledge

Metadata scraping using the New York Times API Sep 02, 2023 pm 10:13 PM

Introduction Last week, I wrote an introduction about scraping web pages to collect metadata, and mentioned that it was impossible to scrape the New York Times website. The New York Times paywall blocks your attempts to collect basic metadata. But there is a way to solve this problem using New York Times API. Recently I started building a community website on the Yii platform, which I will publish in a future tutorial. I want to be able to easily add links that are relevant to the content on my site. While people can easily paste URLs into forms, providing title and source information is time-consuming. So in today's tutorial I'm going to extend the scraping code I recently wrote to leverage the New York Times API to collect headlines when adding a New York Times link. Remember, I'm involved

Scrapy implements crawling and analysis of WeChat public account articles Jun 22, 2023 am 09:41 AM

Scrapy implements article crawling and analysis of WeChat public accounts. WeChat is a popular social media application in recent years, and the public accounts operated in it also play a very important role. As we all know, WeChat public accounts are an ocean of information and knowledge, because each public account can publish articles, graphic messages and other information. This information can be widely used in many fields, such as media reports, academic research, etc. So, this article will introduce how to use the Scrapy framework to crawl and analyze WeChat public account articles. Scr

Scrapy case analysis: How to crawl company information on LinkedIn Jun 23, 2023 am 10:04 AM

Scrapy is a Python-based crawler framework that can quickly and easily obtain relevant information on the Internet. In this article, we will use a Scrapy case to analyze in detail how to crawl company information on LinkedIn. Determine the target URL First, we need to make it clear that our target is the company information on LinkedIn. Therefore, we need to find the URL of the LinkedIn company information page. Open the LinkedIn website, enter the company name in the search box, and

Scrapy asynchronous loading implementation method based on Ajax Jun 22, 2023 pm 11:09 PM

Scrapy is an open source Python crawler framework that can quickly and efficiently obtain data from websites. However, many websites use Ajax asynchronous loading technology, making it impossible for Scrapy to obtain data directly. This article will introduce the Scrapy implementation method based on Ajax asynchronous loading. 1. Ajax asynchronous loading principle Ajax asynchronous loading: In the traditional page loading method, after the browser sends a request to the server, it must wait for the server to return a response and load the entire page before proceeding to the next step.

How to crawl and process data by calling API interface in PHP project? Sep 05, 2023 am 08:41 AM

How to crawl and process data by calling API interface in PHP project? 1. Introduction In PHP projects, we often need to crawl data from other websites and process these data. Many websites provide API interfaces, and we can obtain data by calling these interfaces. This article will introduce how to use PHP to call the API interface to crawl and process data. 2. Obtain the URL and parameters of the API interface. Before starting, we need to obtain the URL of the target API interface and the required parameters.

Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Jun 22, 2023 pm 01:57 PM

Scrapy is a powerful Python crawler framework that can be used to obtain large amounts of data from the Internet. However, when developing Scrapy, we often encounter the problem of crawling duplicate URLs, which wastes a lot of time and resources and affects efficiency. This article will introduce some Scrapy optimization techniques to reduce the crawling of duplicate URLs and improve the efficiency of Scrapy crawlers. 1. Use the start_urls and allowed_domains attributes in the Scrapy crawler to

How to open news and interest content on Windows 10 Jan 13, 2024 pm 05:54 PM

For those users who are deeply in love with the Windows 10 operating system, they must have noticed the information and interest recommendation function presented in the lower right corner of their desktop. This feature will show you all kinds of exciting news information at the right moment. However, some users may find it too cumbersome and choose to turn it off; on the other hand, some users prefer to keep it enabled. At this moment, you can use the following detailed steps to easily adjust these settings anytime and anywhere. How to open news and interests in win10 1. First press win+R and then enter "winver" and press Enter. Then you can check your computer version information to confirm whether it is the 21h1 version. 2. Right-click on the taskbar and select "Information and Interests" 3. Here

Vue development experience summary: tips for optimizing SEO and search engine crawling Nov 22, 2023 am 10:56 AM

Summary of Vue development experience: Tips for optimizing SEO and search engine crawling. With the rapid development of the Internet, website SEO (SearchEngineOptimization, search engine optimization) has become more and more important. For websites developed using Vue, optimizing for SEO and search engine crawling is crucial. This article will summarize some Vue development experience and share some tips for optimizing SEO and search engine crawling. Using prerendering technology Vue

See all articles