Analysis of issues related to scrapy crawling Mafengwo data-Python Tutorial-php.cn

Analysis of issues related to scrapy crawling Mafengwo data

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2023-06-22 08:29:42

Original

1045 people have browsed it

With the continuous development of the Internet, data acquisition and processing are becoming more and more important. Crawler technology is one of the important technologies. As a popular Python crawler framework, Scrapy can help us implement data crawling more conveniently, and it also has the advantages of efficiency, stability, and scalability. Next, we take crawling the Mafengwo website as an example to analyze issues related to Scrapy crawler technology.

1. How to circumvent the anti-crawler mechanism?

In the process of crawling website data, we often encounter limitations of the anti-crawler mechanism. Here are several methods to avoid the anti-crawler mechanism:

(1) Set request header information: Add request header information to the crawler code and pretend to be an ordinary user to make data requests. For example, User-Agent request header information.

(2) Set the request delay: Set the request delay reasonably to avoid frequently requesting data too quickly.

(3) Use proxy IP: Using proxy IP to access the target site can effectively circumvent the anti-crawler mechanism.

2. How to parse web page data?

Parsing web page data is a key step in Scrapy crawler technology. Scrapy has two built-in parsing methods: xpath and css selector.

(1) XPath: XPath is an XML-based query language. Scrapy uses the XPath parser to easily extract text, attributes and other content.

For example, if you want to get all the tourist destination names and links on Mafengwo’s homepage, you can use the following code:

def start_requests(self):
    yield scrapy.Request(url=self.url, callback=self.parse, headers=self.headers)

def parse(self, response):
    sel = Selector(response)
    items = sel.xpath('//div[@class="hot-list cf"]/div[@class="hot-list-item"]/a')
    for item in items:
        destination = item.xpath('./text()').extract()[0]
        link = item.xpath('./@href').extract()[0]
        yield {
            'destination': destination,
            'link': link
        }

Copy after login

(2) CSS selector: CSS selector is a A more intuitive selector approach. Scrapy uses css selectors to easily extract tags, attributes and other content.

Similarly take the example of obtaining the name and link of a tourist destination on the homepage of Mafengwo. The code using the CSS selector is as follows:

def start_requests(self):
    yield scrapy.Request(url=self.url, callback=self.parse, headers=self.headers)

def parse(self, response):
    items = response.css('.hot-list-item > a')
    for item in items:
        destination = item.css('::text').extract_first()
        link = item.css('::attr(href)').extract_first()
        yield {
            'destination': destination,
            'link': link
        }

Copy after login

3. How to achieve data persistence?

In the process of crawling website data, we usually save the data for subsequent analysis and use. As for the persistent storage of data, there are two common methods: file storage and database storage.

(1) File storage: Use Python’s built-in file operation function to save the crawled data to a local file.

For example, use the following code in Scrapy to save data to a .csv file:

import csv

def process_item(self, item, spider):
    with open('data.csv', 'a', newline='') as f:
        writer = csv.writer(f)
        writer.writerow([item['destination'], item['link']])
    return item

Copy after login

(2) Database storage: Use the commonly used relational databases MySQL, SQLite, etc. in Python to store data. Implement persistent storage of data.

For example, use the following code in Scrapy to save data to a MySQL database:

import pymysql

def __init__(self, db_settings):
        self.host = db_settings['HOST']
        self.port = db_settings['PORT']
        self.user = db_settings['USER']
        self.password = db_settings['PASSWORD']
        self.db = db_settings['DB']
        try:
            self.conn = pymysql.connect(
                host=self.host,
                port=self.port,
                user=self.user,
                password=self.password,
                db=self.db,
                charset='utf8'
            )
            self.cursor = self.conn.cursor()
        except Exception as e:
            print(e)

def process_item(self, item, spider):
    sql = "INSERT INTO destination(name, link) VALUES(%s, %s)"
    self.cursor.execute(sql, (item['destination'], item['link']))
    self.conn.commit()
    return item

Copy after login

Summary

Through the analysis of the above three questions, we can see that Scrapy is a A powerful and easy-to-use crawler framework that can help us easily implement functions such as data crawling, data parsing, and data persistence. Of course, in the actual application process, we will encounter various problems and challenges, which require us to continue to learn and improve.

The above is the detailed content of Analysis of issues related to scrapy crawling Mafengwo data. For more information, please follow other related articles on the PHP Chinese website!