


A brief introduction to the crawler framework (talonspider) in Python
This article introduces to you a simple introduction and usage method of the crawler developed using pythonframeworktalonspider. Friends who need it can refer to it
1.Why write this?
There is no need to use a relatively large framework to crawl some simple pages. It would be troublesome to write them by hand.
So I wrote talonspider for this requirement:
•1. For single page item extraction - click here for a detailed introduction
•2.Spider module-click here for a detailed introduction
2.Introduction &&Usage
2.1.item
This module can be used independently. For some websites with relatively simple requests (such as only get requests), you can quickly write what you want by using this module alone. Crawler, for example (the following uses python3, see the examples directory for python2):
2.1.1. Single page single target
For example, to obtain this URL http://book.qidian.com/ The book information, cover and other information of info/1004608738 can be written directly like this:
import time from talonspider import Item, TextField, AttrField from pprint import pprint class TestSpider(Item): title = TextField(css_select='.book-info>h1>em') author = TextField(css_select='a.writer') cover = AttrField(css_select='a#bookImg>img', attr='src') def tal_title(self, title): return title def tal_cover(self, cover): return 'http:' + cover if name == 'main': item_data = TestSpider.get_item(url='http://book.qidian.com/info/1004608738') pprint(item_data)
For details, see qidian_details_by_item.py
2.1.1. Multiple targets on a single page
For example, to obtain the 25 movies displayed on the Douban 250 movie homepage, this page has 25 targets. You can write directly like this:
from talonspider import Item, TextField, AttrField from pprint import pprint # 定义继承自item的爬虫类 class DoubanSpider(Item): target_item = TextField(css_select='p.item') title = TextField(css_select='span.title') cover = AttrField(css_select='p.pic>a>img', attr='src') abstract = TextField(css_select='span.inq') def tal_title(self, title): if isinstance(title, str): return title else: return ''.join([i.text.strip().replace('\xa0', '') for i in title]) if name == 'main': items_data = DoubanSpider.get_items(url='movie.douban.com/top250') result = [] for item in items_data: result.append({ 'title': item.title, 'cover': item.cover, 'abstract': item.abstract, }) pprint(result)
For details, see douban_page_by_item.py
2.2.spider
When you need to crawl hierarchical pages, such as crawling all Douban 250 movies, then the spider part comes in handy Usage:
# !/usr/bin/env python from talonspider import Spider, Item, TextField, AttrField, Request from talonspider.utils import get_random_user_agent # 定义继承自item的爬虫类 class DoubanItem(Item): target_item = TextField(css_select='p.item') title = TextField(css_select='span.title') cover = AttrField(css_select='p.pic>a>img', attr='src') abstract = TextField(css_select='span.inq') def tal_title(self, title): if isinstance(title, str): return title else: return ''.join([i.text.strip().replace('\xa0', '') for i in title]) class DoubanSpider(Spider): # 定义起始url,必须 start_urls = ['https://movie.douban.com/top250'] # requests配置 request_config = { 'RETRIES': 3, 'DELAY': 0, 'TIMEOUT': 20 } # 解析函数 必须有 def parse(self, html): # 将html转化为etree etree = self.e_html(html) # 提取目标值生成新的url pages = [i.get('href') for i in etree.cssselect('.paginator>a')] pages.insert(0, '?start=0&filter=') headers = { "User-Agent": get_random_user_agent() } for page in pages: url = self.start_urls[0] + page yield Request(url, request_config=self.request_config, headers=headers, callback=self.parse_item) def parse_item(self, html): items_data = DoubanItem.get_items(html=html) # result = [] for item in items_data: # result.append({ # 'title': item.title, # 'cover': item.cover, # 'abstract': item.abstract, # }) # 保存 with open('douban250.txt', 'a+') as f: f.writelines(item.title + '\n') if name == 'main': DoubanSpider.start()
Console:
/Users/howie/anaconda3/envs/work3/bin/python /Users/howie/Documents/programming/python/git/talonspider/examples/douban_page_by_spider.py 2017-06-07 23:17:30,346 - talonspider - INFO: talonspider started 2017-06-07 23:17:30,693 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250 2017-06-07 23:17:31,074 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=25&filter= 2017-06-07 23:17:31,416 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=50&filter= 2017-06-07 23:17:31,853 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=75&filter= 2017-06-07 23:17:32,523 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=100&filter= 2017-06-07 23:17:33,032 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=125&filter= 2017-06-07 23:17:33,537 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=150&filter= 2017-06-07 23:17:33,990 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=175&filter= 2017-06-07 23:17:34,406 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=200&filter= 2017-06-07 23:17:34,787 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=225&filter= 2017-06-07 23:17:34,809 - talonspider - INFO: Time usage:0:00:04.462108 Process finished with exit code 0
At this time, douban250 will be generated in the current directory .txt, see douban_page_by_spider.py for details.
3. Description
This is a work of study. There are still many areas to be improved. Comments are welcome. The project address is talonspider.
The above is the detailed content of A brief introduction to the crawler framework (talonspider) in Python. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

The speed of mobile XML to PDF depends on the following factors: the complexity of XML structure. Mobile hardware configuration conversion method (library, algorithm) code quality optimization methods (select efficient libraries, optimize algorithms, cache data, and utilize multi-threading). Overall, there is no absolute answer and it needs to be optimized according to the specific situation.

To generate images through XML, you need to use graph libraries (such as Pillow and JFreeChart) as bridges to generate images based on metadata (size, color) in XML. The key to controlling the size of the image is to adjust the values of the <width> and <height> tags in XML. However, in practical applications, the complexity of XML structure, the fineness of graph drawing, the speed of image generation and memory consumption, and the selection of image formats all have an impact on the generated image size. Therefore, it is necessary to have a deep understanding of XML structure, proficient in the graphics library, and consider factors such as optimization algorithms and image format selection.

An application that converts XML directly to PDF cannot be found because they are two fundamentally different formats. XML is used to store data, while PDF is used to display documents. To complete the transformation, you can use programming languages and libraries such as Python and ReportLab to parse XML data and generate PDF documents.

Use most text editors to open XML files; if you need a more intuitive tree display, you can use an XML editor, such as Oxygen XML Editor or XMLSpy; if you process XML data in a program, you need to use a programming language (such as Python) and XML libraries (such as xml.etree.ElementTree) to parse.

There is no built-in sum function in C language, so it needs to be written by yourself. Sum can be achieved by traversing the array and accumulating elements: Loop version: Sum is calculated using for loop and array length. Pointer version: Use pointers to point to array elements, and efficient summing is achieved through self-increment pointers. Dynamically allocate array version: Dynamically allocate arrays and manage memory yourself, ensuring that allocated memory is freed to prevent memory leaks.

There is no APP that can convert all XML files into PDFs because the XML structure is flexible and diverse. The core of XML to PDF is to convert the data structure into a page layout, which requires parsing XML and generating PDF. Common methods include parsing XML using Python libraries such as ElementTree and generating PDFs using ReportLab library. For complex XML, it may be necessary to use XSLT transformation structures. When optimizing performance, consider using multithreaded or multiprocesses and select the appropriate library.

It is impossible to complete XML to PDF conversion directly on your phone with a single application. It is necessary to use cloud services, which can be achieved through two steps: 1. Convert XML to PDF in the cloud, 2. Access or download the converted PDF file on the mobile phone.

XML formatting tools can type code according to rules to improve readability and understanding. When selecting a tool, pay attention to customization capabilities, handling of special circumstances, performance and ease of use. Commonly used tool types include online tools, IDE plug-ins, and command-line tools.
