抓取但验证：使用 Pydantic Validation 抓取数据-Python教程-PHP中文网

首页

后端开发

Python教程

抓取但验证：使用 Pydantic Validation 抓取数据

Susan Sarandon

Nov 22, 2024 am 07:40 AM

注意：不是 chatGPT/LLM 的输出

数据抓取是从公共网络源收集数据的过程，主要是使用脚本以自动化方式完成。由于自动化，收集的数据常常存在错误，需要过滤和清理才能使用。不过，如果抓取的数据能够在抓取过程中得到验证，那就更好了。

考虑到数据验证的要求，大多数抓取框架（如Scrapy）都有可用于数据验证的内置模式。然而，很多时候，在数据抓取过程中，我们经常只使用通用模块，例如 requests 和 beautifulsoup 进行抓取。在这种情况下，很难验证收集到的数据，因此这篇博文解释了一种使用 Pydantic 进行数据抓取和验证的简单方法。
https://docs.pydantic.dev/latest/
Pydantic 是一个数据验证 Python 模块。它也是流行的 api 模块 FastAPI 的骨干，就像 Pydantic 一样，还有其他 python 模块，可用于数据抓取期间的验证。然而，这篇博客探讨了 pydantic，这里是替代包的链接（您可以尝试使用任何其他模块更改 pydantic 作为学习练习）

Cerberus 是一个轻量级且可扩展的 Python 数据验证库。 https://pypi.org/project/Cerberus/

刮痧计划：

在此博客中，我们将从报价网站中删除报价。
我们将使用 requests 和 beautifulsoup 来获取数据将创建一个 pydantic 数据类来验证每个抓取的数据将过滤和验证的数据保存在 json 文件中。

为了更好的安排和理解，每个步骤都实现为可以在 main 部分下使用的 python 方法。

基本导入

import requests # for web request
from bs4 import BeautifulSoup # cleaning html content

# pydantic for validation

from pydantic import BaseModel, field_validator, ValidationError

import json

登录后复制

1. 目标站点并获取报价

我们正在使用 (http://quotes.toscrape.com/) 来抓取报价。每个引用将包含三个字段：quote_text、作者和标签。例如：

Scrape but Validate: Data scraping with Pydantic Validation

下面的方法是获取给定 url 的 html 内容的通用脚本。

def get_html_content(page_url: str) -> str:
    page_content =""
    # Send a GET request to the website
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        page_content = response.content
    else:
        page_content = f'Failed to retrieve the webpage. Status code: {response.status_code}'
    return page_content

登录后复制

2. 抓取报价数据

我们将使用 requests 和 beautifulsoup 从给定的 url 中抓取数据。该过程分为三个部分：1）从网络获取 html 内容 2）为每个目标字段提取所需的 html 标签 3）从每个标签获取值

import requests # for web request
from bs4 import BeautifulSoup # cleaning html content

# pydantic for validation

from pydantic import BaseModel, field_validator, ValidationError

import json

登录后复制

def get_html_content(page_url: str) -> str:
    page_content =""
    # Send a GET request to the website
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        page_content = response.content
    else:
        page_content = f'Failed to retrieve the webpage. Status code: {response.status_code}'
    return page_content

登录后复制

下面的脚本从每个报价的 div 中获取数据点。

def get_tags(tags):
    tags =[tag.get_text() for tag in tags.find_all('a')]
    return tags

登录后复制

3. 创建 Pydantic 数据类并验证每个报价的数据

根据引用的每个字段，创建一个 pydantic 类并在数据抓取期间使用相同的类进行数据验证。

pydantic 模型引用

下面是从 BaseModel 扩展而来的 Quote 类，具有三个字段，如 quote_text、作者和标签。其中，quote_text 和author 是字符串（str）类型，tags 是列表类型。

我们有两个验证器方法（带有装饰器）：

1）tags_more_than_two（）：将检查它是否必须有两个以上的标签。（这只是举例，你可以在这里有任何规则）

2.) check_quote_text()：此方法将从引用中删除“”并测试文本。

def get_quotes_div(html_content:str) -> str :    
    # Parse the page content with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find all the quotes on the page
    quotes = soup.find_all('div', class_='quote')

    return quotes

登录后复制

获取和验证数据

使用 pydantic 进行数据验证非常简单，例如下面的代码，将抓取的数据传递给 pydantic 类 Quote。

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }

登录后复制

class Quote(BaseModel):
    quote_text:str
    author:str
    tags: list

    @field_validator('tags')
    @classmethod
    def tags_more_than_two(cls, tags_list:list) -> list:
        if len(tags_list) <=2:
            raise ValueError("There should be more than two tags.")
        return tags_list

    @field_validator('quote_text')
    @classmethod    
    def check_quote_text(cls, quote_text:str) -> str:
        return quote_text.removeprefix('“').removesuffix('”')

登录后复制

4. 存储数据

数据经过验证后，将保存到 json 文件中。（编写了一个通用方法，将 Python 字典转换为 json 文件）

quote_data = Quote(**quote_temp)

登录后复制

将所有内容放在一起

了解了每一个抓取之后，现在，您可以将所有内容放在一起并运行抓取以进行数据收集。

def get_quotes_data(quotes_div: list) -> list:
    quotes_data = []

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }

        # validate data with Pydantic model
        try:
            quote_data = Quote(**quote_temp)            
            quotes_data.append(quote_data.model_dump())            
        except  ValidationError as e:
            print(e.json())
    return quotes_data

登录后复制

注意：计划进行修订，请告诉我您的想法或建议，以包含在修订版本中。

链接和资源：

https://pypi.org/project/parsel/
https://docs.pydantic.dev/latest/

以上是抓取但验证：使用 Pydantic Validation 抓取数据的详细内容。更多信息请关注PHP中文网其他相关文章！

本站声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

热AI工具

Undresser.AI Undress

人工智能驱动的应用程序，用于创建逼真的裸体照片

AI Clothes Remover

用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool

免费脱衣服图片

Clothoff.io

AI脱衣机

Video Face Swap

使用我们完全免费的人工智能换脸工具轻松在任何视频中换脸！

显示更多

热工具

记事本++7.3.1

好用且免费的代码编辑器

SublimeText3汉化版

中文版，非常好用

禅工作室 13.0.1

功能强大的PHP集成开发环境

Dreamweaver CS6

视觉化网页开发工具

SublimeText3 Mac版

神级代码编辑软件(SublimeText3)

显示更多

热门话题

Java教程

1668

CakePHP 教程

1426

Laravel 教程

1328

PHP教程

1273

C# 教程

1256

显示更多

Related knowledge

Python：游戏，Guis等 Apr 13, 2025 am 12:14 AM

Python在游戏和GUI开发中表现出色。1)游戏开发使用Pygame，提供绘图、音频等功能，适合创建2D游戏。2)GUI开发可选择Tkinter或PyQt，Tkinter简单易用，PyQt功能丰富，适合专业开发。

Python与C：学习曲线和易用性 Apr 19, 2025 am 12:20 AM

Python更易学且易用，C 则更强大但复杂。1.Python语法简洁，适合初学者，动态类型和自动内存管理使其易用，但可能导致运行时错误。2.C 提供低级控制和高级特性，适合高性能应用，但学习门槛高，需手动管理内存和类型安全。

Python和时间：充分利用您的学习时间 Apr 14, 2025 am 12:02 AM

要在有限的时间内最大化学习Python的效率，可以使用Python的datetime、time和schedule模块。1.datetime模块用于记录和规划学习时间。2.time模块帮助设置学习和休息时间。3.schedule模块自动化安排每周学习任务。

Python vs.C：探索性能和效率 Apr 18, 2025 am 12:20 AM

Python在开发效率上优于C ，但C 在执行性能上更高。1.Python的简洁语法和丰富库提高开发效率。2.C 的编译型特性和硬件控制提升执行性能。选择时需根据项目需求权衡开发速度与执行效率。

Python标准库的哪一部分是：列表或数组？ Apr 27, 2025 am 12:03 AM

pythonlistsarepartofthestAndArdLibrary，herilearRaysarenot.listsarebuilt-In，多功能，和Rused ForStoringCollections，而EasaraySaraySaraySaraysaraySaraySaraysaraySaraysarrayModuleandleandleandlesscommonlyusedDduetolimitedFunctionalityFunctionalityFunctionality。