Home Backend Development Python Tutorial Example tutorial of implementing crawler with requests and lxml

Example tutorial of implementing crawler with requests and lxml

Jun 20, 2017 pm 02:46 PM
lxml requests accomplish reptile

# requests module to request pages
# lxml module's html build selector selector (formatted response response)
# from lxml import html
# import requests

response = requests.get(url).content

# selector = html.formatstring(response)

hrefs = selector.xpath('/html/body//div[@class='feed-item _j_feed_item']/a/@href' )

# Take url = 'https://www.mafengwo.cn/gonglve/ziyouxing/2033.html' as an example

# python 2.7import requestsfrom lxml import htmlimport os
Copy after login

1 # 获取首页中子页的url链接2 def get_page_urls(url):3     response = requests.get(url).content4     # 通过lxml的html来构建选择器5     selector = html.fromstring(response)6     urls = []7     for i in selector.xpath("/html/body//div[@class='feed-item _j_feed_item']/a/@href"):8         urls.append(i)9     return urls
Copy after login
1 # get title from a child's html(div[@class='title'])2 def get_page_a_title(url):3     '''url is ziyouxing's a@href'''4     response = requests.get(url).content5     selector = html.fromstring(response)6     # get xpath by chrome's tool  -->  /html/body//div[@class='title']/text()7     a_title = selector.xpath("/html/body//div[@class='title']/text()")8     return a_title
Copy after login
 1 # 获取页面选择器(通过lxml的html构建) 2 def get_selector(url): 3     response = requests.get(url).content 4     selector = html.fromstring(response) 5     return selector
Copy after login
# 通过chrome的开发者工具分析html页面结构后发现,我们需要获取的文本内容主要显示在div[@class='l-topic']和div[@class='p-section']中
Copy after login
1  # 获取所需的文本内容2  def get_page_content(selector):3      # /html/body/div[2]/div[2]/div[1]/div[@class='l-topic']/p/text()4      page_title = selector.xpath("//div[@class='l-topic']/p/text()")5      # /html/body/div[2]/div[2]/div[1]/div[2]/div[15]/div[@class='p-section']/text()6      page_content = selector.xpath("//div[@class='p-section']/text()")7      return page_title,page_content
Copy after login
1 # 获取页面中的图片url地址2 def get_image_urls(selector):3     imagesrcs = selector.xpath("//img[@class='_j_lazyload']/@src")4     return imagesrcs
Copy after login
  # 获取图片的标题
Copy after login
1 def get_image_title(selector, num)2     # num 是从2开始的3     url = "/html/body/div[2]/div[2]/div[1]/div[2]/div["+num+"]/span[@class='img-an']/text()"4     if selector.xpath(url) is not None:5         image_title = selector.xpath(url)6     else:7         image_title = "map"+str(num) # 没有就起一个8     return image_title
Copy after login
  # 下载图片
Copy after login
 1 def downloadimages(selector,number): 2     '''number是用来计数的''' 3     urls = get_image_urls() 4     num = 2 5     amount = len(urls) 6     for url in urls: 7         image_title = get_image_title(selector, num) 8         filename = "/home/WorkSpace/tour/words/result"+number+"/+"image_title+".jpg" 9         if not os.path.exists(filename):10             os.makedirs(filename)11         print('downloading %s image %s' %(number, image_title))12         with open(filename, 'wb') as f:13             f.write(requests.get(url).content)14         num += 115     print "已经下载了%s张图" %num
Copy after login

# 入口,启动并把获取的数据存入文件中if __name__ =='__main__':
    url = ''urls = get_page_urls(url)# turn to get response from htmlnumber = 1for i in urls:
        selector = get_selector(i)# download images      downloadimages(selector,number)# get text and write into a filepage_title, page_content = get_page_content(selector)
        result = page_title+'\n'+page_content+'\n\n'path = "/home/WorkSpace/tour/words/result"+num+"/"if not os.path.exists(filename):
            os.makedirs(filename)
        filename = path + "num"+".txt"with open(filename,'wb') as f:
            f.write(result)print result
Copy after login

This is the end of the crawler. Before crawling the page, you must carefully analyze the html structure. Some pages are generated by js. This page is relatively simple and does not involve js processing. There will be relevant sharing in future essays

The above is the detailed content of Example tutorial of implementing crawler with requests and lxml. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to implement dual WeChat login on Huawei mobile phones? How to implement dual WeChat login on Huawei mobile phones? Mar 24, 2024 am 11:27 AM

How to implement dual WeChat login on Huawei mobile phones? With the rise of social media, WeChat has become one of the indispensable communication tools in people's daily lives. However, many people may encounter a problem: logging into multiple WeChat accounts at the same time on the same mobile phone. For Huawei mobile phone users, it is not difficult to achieve dual WeChat login. This article will introduce how to achieve dual WeChat login on Huawei mobile phones. First of all, the EMUI system that comes with Huawei mobile phones provides a very convenient function - dual application opening. Through the application dual opening function, users can simultaneously

How to implement the WeChat clone function on Huawei mobile phones How to implement the WeChat clone function on Huawei mobile phones Mar 24, 2024 pm 06:03 PM

How to implement the WeChat clone function on Huawei mobile phones With the popularity of social software and people's increasing emphasis on privacy and security, the WeChat clone function has gradually become the focus of people's attention. The WeChat clone function can help users log in to multiple WeChat accounts on the same mobile phone at the same time, making it easier to manage and use. It is not difficult to implement the WeChat clone function on Huawei mobile phones. You only need to follow the following steps. Step 1: Make sure that the mobile phone system version and WeChat version meet the requirements. First, make sure that your Huawei mobile phone system version has been updated to the latest version, as well as the WeChat App.

PHP Programming Guide: Methods to Implement Fibonacci Sequence PHP Programming Guide: Methods to Implement Fibonacci Sequence Mar 20, 2024 pm 04:54 PM

The programming language PHP is a powerful tool for web development, capable of supporting a variety of different programming logics and algorithms. Among them, implementing the Fibonacci sequence is a common and classic programming problem. In this article, we will introduce how to use the PHP programming language to implement the Fibonacci sequence, and attach specific code examples. The Fibonacci sequence is a mathematical sequence defined as follows: the first and second elements of the sequence are 1, and starting from the third element, the value of each element is equal to the sum of the previous two elements. The first few elements of the sequence

Master how Golang enables game development possibilities Master how Golang enables game development possibilities Mar 16, 2024 pm 12:57 PM

In today's software development field, Golang (Go language), as an efficient, concise and highly concurrency programming language, is increasingly favored by developers. Its rich standard library and efficient concurrency features make it a high-profile choice in the field of game development. This article will explore how to use Golang for game development and demonstrate its powerful possibilities through specific code examples. 1. Golang’s advantages in game development. As a statically typed language, Golang is used in building large-scale game systems.

PHP Game Requirements Implementation Guide PHP Game Requirements Implementation Guide Mar 11, 2024 am 08:45 AM

PHP Game Requirements Implementation Guide With the popularity and development of the Internet, the web game market is becoming more and more popular. Many developers hope to use the PHP language to develop their own web games, and implementing game requirements is a key step. This article will introduce how to use PHP language to implement common game requirements and provide specific code examples. 1. Create game characters In web games, game characters are a very important element. We need to define the attributes of the game character, such as name, level, experience value, etc., and provide methods to operate these

How to implement exact division operation in Golang How to implement exact division operation in Golang Feb 20, 2024 pm 10:51 PM

Implementing exact division operations in Golang is a common need, especially in scenarios involving financial calculations or other scenarios that require high-precision calculations. Golang's built-in division operator "/" is calculated for floating point numbers, and sometimes there is a problem of precision loss. In order to solve this problem, we can use third-party libraries or custom functions to implement exact division operations. A common approach is to use the Rat type from the math/big package, which provides a representation of fractions and can be used to implement exact division operations.

Detailed explanation of using Golang to implement data export function Detailed explanation of using Golang to implement data export function Feb 28, 2024 pm 01:42 PM

Title: Detailed explanation of data export function using Golang. With the improvement of informatization, many enterprises and organizations need to export data stored in databases into different formats for data analysis, report generation and other purposes. This article will introduce how to use the Golang programming language to implement the data export function, including detailed steps to connect to the database, query data, and export data to files, and provide specific code examples. To connect to the database first, we need to use the database driver provided in Golang, such as da

Using PHP to implement SaaS: a comprehensive analysis Using PHP to implement SaaS: a comprehensive analysis Mar 07, 2024 pm 10:18 PM

I'm really sorry that I can't provide real-time programming guidance, but I can provide you with a code example to give you a better understanding of how to use PHP to implement SaaS. The following is an article within 1,500 words, titled "Using PHP to implement SaaS: A comprehensive analysis." In today's information age, SaaS (Software as a Service) has become the mainstream way for enterprises and individuals to use software. It provides a more flexible and convenient way to access software. With SaaS, users don’t need to be on-premises

See all articles