基于设计原则的重构:数据采集爬虫系统示例
介绍
提高代码质量始终是软件开发中的一个重要问题。在本文中,我们以数据收集爬虫系统为例,具体讲解如何通过逐步重构来应用设计原则和最佳实践。
改进前的代码
首先,我们从一个非常简单的网络抓取工具开始,将所有功能集成到一个类中。
由 DeepL.com 翻译(免费版)
project_root/ ├── web_scraper.py ├── main.py └── requirements.txt
web_scraper.py
import requests import json import sqlite3 class WebScraper: def __init__(self, url): self.url = url def fetch_data(self): response = requests.get(self.url) data = response.text parsed_data = self.parse_data(data) enriched_data = self.enrich_data(parsed_data) self.save_data(enriched_data) return enriched_data def parse_data(self, data): return json.loads(data) def enrich_data(self, data): # Apply business logic here # Example: extract only data containing specific keywords return {k: v for k, v in data.items() if 'important' in v.lower()} def save_data(self, data): conn = sqlite3.connect('test.db') cursor = conn.cursor() cursor.execute('INSERT INTO data (json_data) VALUES (?)', (json.dumps(data),)) conn.commit() conn.close()
main.py
from web_scraper import WebScraper def main(): scraper = WebScraper('https://example.com/api/data') data = scraper.fetch_data() print(data) if __name__ == "__main__": main()
需要改进的地方
- 违反了单一职责原则:一个类负责所有数据采集、分析、丰富和存储
- 业务逻辑不清晰:业务逻辑嵌入在enrich_data方法中,但与其他处理混合
- 缺乏可重用性:函数紧密耦合,使得单独重用变得困难
- 测试难点:难以独立测试各个功能
- 配置刚性:数据库路径和其他设置直接嵌入代码中
重构阶段
1.职责分离:数据采集、分析、存储分离
- 重大变化:将数据采集、分析和存储的职责分离到不同的类中
- 目标:应用单一责任原则,引入环境变量
目录结构
project_root/ ├── data_fetcher.py ├── data_parser.py ├── data_saver.py ├── data_enricher.py ├── web_scraper.py ├── main.py └── requirements.txt
data_enricher.py
class DataEnricher: def enrich(self, data): return {k: v for k, v in data.items() if 'important' in v.lower()}
web_scraper.py
from data_fetcher import DataFetcher from data_parser import DataParser from data_enricher import DataEnricher from data_saver import DataSaver class WebScraper: def __init__(self, url): self.url = url self.fetcher = DataFetcher() self.parser = DataParser() self.enricher = DataEnricher() self.saver = DataSaver() def fetch_data(self): raw_data = self.fetcher.fetch(self.url) parsed_data = self.parser.parse(raw_data) enriched_data = self.enricher.enrich(parsed_data) self.saver.save(enriched_data) return enriched_data
此更改明确了每个类的职责并提高了可重用性和可测试性。然而,业务逻辑仍然嵌入在 DataEnricher 类中。
2.接口介绍和依赖注入
- 主要变化:引入接口并实现依赖注入。
- 目的:增加灵活性和可扩展性,扩展环境变量,抽象业务逻辑
目录结构
project_root/ ├── interfaces/ │ ├── __init__.py │ ├── data_fetcher_interface.py │ ├── data_parser_interface.py │ ├── data_enricher_interface.py │ └── data_saver_interface.py ├── implementations/ │ ├── __init__.py │ ├── http_data_fetcher.py │ ├── json_data_parser.py │ ├── keyword_data_enricher.py │ └── sqlite_data_saver.py ├── web_scraper.py ├── main.py └── requirements.txt
接口/data_fetcher_interface.py
from abc import ABC, abstractmethod class DataFetcherInterface(ABC): @abstractmethod def fetch(self, url: str) -> str: pass
接口/data_parser_interface.py
from abc import ABC, abstractmethod from typing import Dict, Any class DataParserInterface(ABC): @abstractmethod def parse(self, raw_data: str) -> Dict[str, Any]: pass
接口/data_enricher_interface.py
from abc import ABC, abstractmethod from typing import Dict, Any class DataEnricherInterface(ABC): @abstractmethod def enrich(self, data: Dict[str, Any]) -> Dict[str, Any]: pass
接口/data_saver_interface.py
from abc import ABC, abstractmethod from typing import Dict, Any class DataSaverInterface(ABC): @abstractmethod def save(self, data: Dict[str, Any]) -> None: pass
实现/keyword_data_enricher.py
import os from interfaces.data_enricher_interface import DataEnricherInterface class KeywordDataEnricher(DataEnricherInterface): def __init__(self): self.keyword = os.getenv('IMPORTANT_KEYWORD', 'important') def enrich(self, data): return {k: v for k, v in data.items() if self.keyword in str(v).lower()}
web_scraper.py
from interfaces.data_fetcher_interface import DataFetcherInterface from interfaces.data_parser_interface import DataParserInterface from interfaces.data_enricher_interface import DataEnricherInterface from interfaces.data_saver_interface import DataSaverInterface class WebScraper: def __init__(self, fetcher: DataFetcherInterface, parser: DataParserInterface, enricher: DataEnricherInterface, saver: DataSaverInterface): self.fetcher = fetcher self.parser = parser self.enricher = enricher self.saver = saver def fetch_data(self, url): raw_data = self.fetcher.fetch(url) parsed_data = self.parser.parse(raw_data) enriched_data = self.enricher.enrich(parsed_data) self.saver.save(enriched_data) return enriched_data
现阶段主要变化有
- 引入一个接口以方便切换到不同的实现
- 依赖注入使WebScraper类更加灵活
- fetch_data 方法已更改为以 url 作为参数,使 URL 规范更加灵活。
- 业务逻辑已被抽象为 DataEnricherInterface 并实现为 KeywordDataEnricher。
- 通过允许使用环境变量设置关键字,业务逻辑变得更加灵活。
这些改变极大地提高了系统的灵活性和可扩展性。然而,业务逻辑仍然嵌入在 DataEnricherInterface 及其实现中。下一步就是进一步分离这个业务逻辑,并将其明确定义为领域层。
3.领域层的引入和业务逻辑的分离
上一步中,接口的引入增加了系统的灵活性。但是,业务逻辑(在本例中为数据重要性确定和过滤)仍然被视为数据层的一部分。基于领域驱动设计的理念,将此业务逻辑视为系统的中心概念,并将其实现为独立的领域层,可以带来以下好处
- 业务逻辑集中管理
- 通过领域模型更具表现力的代码
- 更改业务规则具有更大的灵活性
- 易于测试
更新的目录结构:
project_root/ ├── domain/ │ ├── __init__.py │ ├── scraped_data.py │ └── data_enrichment_service.py ├── data/ │ ├── __init__.py │ ├── interfaces/ │ │ ├── __init__.py │ │ ├── data_fetcher_interface.py │ │ ├── data_parser_interface.py │ │ └── data_saver_interface.py │ ├── implementations/ │ │ ├── __init__.py │ │ ├── http_data_fetcher.py │ │ ├── json_data_parser.py │ │ └── sqlite_data_saver.py ├── application/ │ ├── __init__.py │ └── web_scraper.py ├── main.py └── requirements.txt
现阶段,DataEnricherInterface 和 KeywordDataEnricher 的角色将转移到领域层的 ScrapedData 模型和 DataEnrichmentService 中。下面提供了此更改的详细信息。
更改前(第 2 部分)
class DataEnricherInterface(ABC): @abstractmethod def enrich(self, data: Dict[str, Any]) -> Dict[str, Any]: pass
class KeywordDataEnricher(DataEnricherInterface): def __init__(self): self.keyword = os.getenv('IMPORTANT_KEYWORD', 'important') def enrich(self, data): return {k: v for k, v in data.items() if self.keyword in str(v).lower()}
修改后(第 3 部分)
@dataclass class ScrapedData: content: Dict[str, Any] source_url: str def is_important(self) -> bool: important_keyword = os.getenv('IMPORTANT_KEYWORD', 'important') return any(important_keyword in str(v).lower() for v in self.content.values())
class DataEnrichmentService: def __init__(self): self.important_keyword = os.getenv('IMPORTANT_KEYWORD', 'important') def enrich(self, data: ScrapedData) -> ScrapedData: if data.is_important(): enriched_content = {k: v for k, v in data.content.items() if self.important_keyword in str(v).lower()} return ScrapedData(content=enriched_content, source_url=data.source_url) return data
此更改改进了以下内容。
业务逻辑已移至域层,消除了对 DataEnricherInterface 的需求。
the KeywordDataEnricher functionality has been merged into the DataEnrichmentService, centralizing the business logic in one place.
The is_important method has been added to the ScrapedData model. This makes the domain model itself responsible for determining the importance of data and makes the domain concept clearer.
DataEnrichmentService now handles ScrapedData objects directly, improving type safety.
The WebScraper class will also be updated to reflect this change.
from data.interfaces.data_fetcher_interface import DataFetcherInterface from data.interfaces.data_parser_interface import DataParserInterface from data.interfaces.data_saver_interface import DataSaverInterface from domain.scraped_data import ScrapedData from domain.data_enrichment_service import DataEnrichmentService class WebScraper: def __init__(self, fetcher: DataFetcherInterface, parser: DataParserInterface, saver: DataSaverInterface, enrichment_service: DataEnrichmentService): self.fetcher = fetcher self.parser = parser self.saver = saver self.enrichment_service = enrichment_service def fetch_data(self, url: str) -> ScrapedData: raw_data = self.fetcher.fetch(url) parsed_data = self.parser.parse(raw_data) scraped_data = ScrapedData(content=parsed_data, source_url=url) enriched_data = self.enrichment_service.enrich(scraped_data) self.saver.save(enriched_data) return enriched_data
This change completely shifts the business logic from the data layer to the domain layer, giving the system a clearer structure. The removal of the DataEnricherInterface and the introduction of the DataEnrichmentService are not just interface replacements, but fundamental changes in the way business logic is handled.
Summary
This article has demonstrated how to improve code quality and apply design principles specifically through a step-by-step refactoring process for the data collection crawler system. The main areas of improvement are as follows.
- Separation of Responsibility: Applying the principle of single responsibility, we separated data acquisition, parsing, enrichment, and storage into separate classes.
- Introduction of interfaces and dependency injection: greatly increased the flexibility and scalability of the system, making it easier to switch to different implementations.
- Introduction of domain model and services: clearly separated the business logic and defined the core concepts of the system.
- Adoption of Layered Architecture: Clearly separated the domain, data, and application layers and defined the responsibilities of each layer. 5.Maintain interfaces: Maintained abstraction at the data layer to ensure flexibility in implementation.
These improvements have greatly enhanced the system's modularity, reusability, testability, maintainability, and scalability. In particular, by applying some concepts of domain-driven design, the business logic became clearer and the structure was more flexible to accommodate future changes in requirements. At the same time, by maintaining the interfaces, we ensured the flexibility to easily change and extend the data layer implementation.
It is important to note that this refactoring process is not a one-time event, but part of a continuous improvement process. Depending on the size and complexity of the project, it is important to adopt design principles and DDD concepts at the appropriate level and to make incremental improvements.
Finally, the approach presented in this article can be applied to a wide variety of software projects, not just data collection crawlers. We encourage you to use them as a reference as you work to improve code quality and design.
以上是基于设计原则的重构:数据采集爬虫系统示例的详细内容。更多信息请关注PHP中文网其他相关文章!

热AI工具

Undresser.AI Undress
人工智能驱动的应用程序,用于创建逼真的裸体照片

AI Clothes Remover
用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool
免费脱衣服图片

Clothoff.io
AI脱衣机

Video Face Swap
使用我们完全免费的人工智能换脸工具轻松在任何视频中换脸!

热门文章

热工具

记事本++7.3.1
好用且免费的代码编辑器

SublimeText3汉化版
中文版,非常好用

禅工作室 13.0.1
功能强大的PHP集成开发环境

Dreamweaver CS6
视觉化网页开发工具

SublimeText3 Mac版
神级代码编辑软件(SublimeText3)

Python更易学且易用,C 则更强大但复杂。1.Python语法简洁,适合初学者,动态类型和自动内存管理使其易用,但可能导致运行时错误。2.C 提供低级控制和高级特性,适合高性能应用,但学习门槛高,需手动管理内存和类型安全。

每天学习Python两个小时是否足够?这取决于你的目标和学习方法。1)制定清晰的学习计划,2)选择合适的学习资源和方法,3)动手实践和复习巩固,可以在这段时间内逐步掌握Python的基本知识和高级功能。

Python在开发效率上优于C ,但C 在执行性能上更高。1.Python的简洁语法和丰富库提高开发效率。2.C 的编译型特性和硬件控制提升执行性能。选择时需根据项目需求权衡开发速度与执行效率。

Python和C 各有优势,选择应基于项目需求。1)Python适合快速开发和数据处理,因其简洁语法和动态类型。2)C 适用于高性能和系统编程,因其静态类型和手动内存管理。

pythonlistsarepartofthestAndArdLibrary,herilearRaysarenot.listsarebuilt-In,多功能,和Rused ForStoringCollections,而EasaraySaraySaraySaraysaraySaraySaraysaraySaraysarrayModuleandleandleandlesscommonlyusedDduetolimitedFunctionalityFunctionalityFunctionality。

Python在自动化、脚本编写和任务管理中表现出色。1)自动化:通过标准库如os、shutil实现文件备份。2)脚本编写:使用psutil库监控系统资源。3)任务管理:利用schedule库调度任务。Python的易用性和丰富库支持使其在这些领域中成为首选工具。

Python在科学计算中的应用包括数据分析、机器学习、数值模拟和可视化。1.Numpy提供高效的多维数组和数学函数。2.SciPy扩展Numpy功能,提供优化和线性代数工具。3.Pandas用于数据处理和分析。4.Matplotlib用于生成各种图表和可视化结果。

Python在Web开发中的关键应用包括使用Django和Flask框架、API开发、数据分析与可视化、机器学习与AI、以及性能优化。1.Django和Flask框架:Django适合快速开发复杂应用,Flask适用于小型或高度自定义项目。2.API开发:使用Flask或DjangoRESTFramework构建RESTfulAPI。3.数据分析与可视化:利用Python处理数据并通过Web界面展示。4.机器学习与AI:Python用于构建智能Web应用。5.性能优化:通过异步编程、缓存和代码优
