Scrape but Validate: Data scraping with Pydantic Validation-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Scrape but Validate: Data scraping with Pydantic Validation

Susan Sarandon

Nov 22, 2024 am 07:40 AM

Note: Not an output of chatGPT/ LLM

Data scraping is process of collecting data from public web sources and it is mostly done using script in a automated way. Due to automation, often collected data have errors and need to filter out and clean for use. However, it will be better if scraped data can be validate during scraping.

Considering the data validation requirement, most of scraping framework like Scrapy have inbuilt pattern that can be used for data validation. However, many a time, during the data scraping process, we often just use general purpose modules like requests and beautifulsoup for scraping. In such case, it is hard to validate the collected data, so this blog post explain a simple approach for data scraping with validation using Pydantic.
https://docs.pydantic.dev/latest/
Pydantic is a data validation python module. It is the backbone of popular api module FastAPI too, like Pydantic there are other python modules too, that can be used for validation during data scraping. However, this blog explore pydantic and here are link of alternatives packages (you can try changing pydantic with any other module as a learning exercise )

Cerberus is a lightweight and extensible data validation library for Python. https://pypi.org/project/Cerberus/

Plan of scraping :

In this blog, we will scrap quotes from the quotes site.
We will use requests and beautifulsoup to get the data Will create a pydantic data class to validate each scraped data Save the filtered and validated data in a json file.

For better arrangement and understanding, each step is implemented as a python method that can be used under main section.

Basic import

import requests # for web request
from bs4 import BeautifulSoup # cleaning html content

# pydantic for validation

from pydantic import BaseModel, field_validator, ValidationError

import json

Copy after login

1. Target site and getting quotes

We are using (http://quotes.toscrape.com/) to scrape the quotes. Each quote will have three fields: quote_text, author, and tags. For example:

Scrape but Validate: Data scraping with Pydantic Validation

Below method is a general script to get html content for a given url.

def get_html_content(page_url: str) -> str:
    page_content =""
    # Send a GET request to the website
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        page_content = response.content
    else:
        page_content = f'Failed to retrieve the webpage. Status code: {response.status_code}'
    return page_content

Copy after login

2. Get the quote data from scraping

We will use requests and beautifulsoup to scraped the data from given urls. The process is broken into three parts: 1) Get the html content from the web 2) Extract the desired html tags for each targeted fields 3) Get the values from each tags

import requests # for web request
from bs4 import BeautifulSoup # cleaning html content

# pydantic for validation

from pydantic import BaseModel, field_validator, ValidationError

import json

Copy after login

def get_html_content(page_url: str) -> str:
    page_content =""
    # Send a GET request to the website
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        page_content = response.content
    else:
        page_content = f'Failed to retrieve the webpage. Status code: {response.status_code}'
    return page_content

Copy after login

Below script get the data point from each quote's div.

def get_tags(tags):
    tags =[tag.get_text() for tag in tags.find_all('a')]
    return tags

Copy after login

3. Create Pydantic dataclass and Validate the data for each quote

As per each fields of the quote, create a pydantic class and use same class for data validation during data scraping.

The pydantic model Quote

Below is the Quote class that is extended from BaseModel having three fields like quote_text, author, and tags. Out of these three, quote_text and author are type of string (str) and tags is a list type.

We have two validator methods (with decorators):

1) tags_more_than_two () : Will check that it must have more than two tags. (it is just for example, you can have any rule here)

2.) check_quote_text(): This method will remove "" from quote and test for text.

def get_quotes_div(html_content:str) -> str :    
    # Parse the page content with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find all the quotes on the page
    quotes = soup.find_all('div', class_='quote')

    return quotes

Copy after login

Getting and validating data

Data validation is very easy with pydantic, for example, below code, pass scraped data to pydantic class Quote.

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }

Copy after login

class Quote(BaseModel):
    quote_text:str
    author:str
    tags: list

    @field_validator('tags')
    @classmethod
    def tags_more_than_two(cls, tags_list:list) -> list:
        if len(tags_list) <=2:
            raise ValueError("There should be more than two tags.")
        return tags_list

    @field_validator('quote_text')
    @classmethod    
    def check_quote_text(cls, quote_text:str) -> str:
        return quote_text.removeprefix('“').removesuffix('”')

Copy after login

4. Store the data

Once data is validated that will be save to a json file. (A general purpose method is written that will convert Python dictionary to json file)

quote_data = Quote(**quote_temp)

Copy after login

Putting all together

After understanding each piece of scraping, now , you can put all together and run the scraping for data collection.

def get_quotes_data(quotes_div: list) -> list:
    quotes_data = []

    # Loop through each quote and extract the text and author
    for quote in quotes_div:
        quote_text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        tags = get_tags(quote.find('div', class_='tags'))

        # yied data to a dictonary 
        quote_temp ={'quote_text': quote_text,
                'author': author,
                'tags':tags
        }

        # validate data with Pydantic model
        try:
            quote_data = Quote(**quote_temp)            
            quotes_data.append(quote_data.model_dump())            
        except  ValidationError as e:
            print(e.json())
    return quotes_data

Copy after login

Note: A revision is planned, let me know your idea or suggestion to include in the revised version.

Links and resources:

https://pypi.org/project/parsel/
https://docs.pydantic.dev/latest/

The above is the detailed content of Scrape but Validate: Data scraping with Pydantic Validation. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1423

Laravel Tutorial

1317

PHP Tutorial

1268

C# Tutorial

1242

Related knowledge

Python vs. C : Applications and Use Cases Compared Apr 12, 2025 am 12:01 AM

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

Python: Games, GUIs, and More Apr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

The 2-Hour Python Plan: A Realistic Approach Apr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Python vs. C : Learning Curves and Ease of Use Apr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

How Much Python Can You Learn in 2 Hours? Apr 09, 2025 pm 04:33 PM

You can learn the basics of Python within two hours. 1. Learn variables and data types, 2. Master control structures such as if statements and loops, 3. Understand the definition and use of functions. These will help you start writing simple Python programs.

Python and Time: Making the Most of Your Study Time Apr 14, 2025 am 12:02 AM

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Python: Automation, Scripting, and Task Management Apr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

Python: Exploring Its Primary Applications Apr 10, 2025 am 09:41 AM

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

See all articles