Web Scraping with Python深入HTML解析_html/css_WEB-ITnose
有人问米开朗基罗:"您是如何创造出《大卫》这样的巨作的?"他答道:"很简单,我去采石场,看见一块巨大的大理石 ,我要做的只是凿去那些不该有的大理石,大卫就诞生了。
同样我们在抓取网页的时候,需要去掉我们不需要的,提取出需要的信息,只不过技术相当复杂。这篇文章将介绍HTML解析技术
在上篇文章( Web Scraping with Python--第一个网页抓取实例)中,我们初步接触了BeutifulSoup库, 这里我们将通过属性来查找标签tags。
几乎所有的网站都包含CSS,对我们抓取网页很有利,CSS依赖于不同的HTML元素有不同的标记,比如:
来看一个网站-http://www.pythonscraping.com/pages/warandpeace.html,里面是一篇文章,口语是红色的字体,而讲话者是绿色的字体,选取其中一个源代码片段:
"Heavens! what a virulent attack!" replied the prince, not in the least disconcerted by this reception.
可以使用上一篇文章中使用的程序来创建一个BeautifulSoup对象来获取整个网页:
from urllib.requestimport urlopenfrom bs4import BeautifulSouphtml = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")bsObj = BeautifulSoup(html)
使用BeautifulSoup对象的findAll方法来提取出一个指定要求的列表
nameList = bsObj.findAll("span", {"class":"green"})for namein nameList: print(name.get_text())
将上面的代码证整理一下:
from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")bsObj = BeautifulSoup(html, "html.parser")nameList = bsObj.findAll("span", {"class": "green"})for namein nameList: print(name.get_text())
运行结果:
Anna
Pavlovna Scherer
Empress Marya
……
解释一下上面的代码:
bsObj.findAll(tagName, tagAttributes) 获取整个页面上的标签的列表,然后通过迭代列表,获取相应的标签的内容
find() 和 findAll()
这两个方法很相似,它们的声明如下:
findAll(tag, attributes, recursive, text, limit, keywords)find(tag, attributes, recursive, text, keywords)
tag参数就像之前见到的那样,你可以传递一个字符串或者一个字符串列表:.findAll({"h1","h2","h3","h4","h5","h6"})
attributes参数传递一个属性和tags相匹配的字典,例如:.findAll("span", {"class":"green", "class":"red"})
recursive参数用于设置是否设置递归
keywor参数允许你包含一个特别的属性,例如:
from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")bsObj = BeautifulSoup(html, "html.parser")allText = bsObj.findAll(id="text")#也可以换为:allText = bsObj.findAll("",{"id":"text"})print(allText[0].get_text())
如果你想查找子标签,可以使用children:
from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html, "html.parser")for childin bsObj.find("table", {"id": "giftList"}).children: print(child)
如果想去掉第一行的
from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html, "html.parser")for siblingin bsObj.find("table", {"id":"giftList"}).tr.next_siblings: print(sibling)
如果你想查找父标签,可以使用 previous_siblings:
from urllib.requestimport urlopenfrom bs4import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html, "html.parser")print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())
从下面的html结构一目了然
—
—
—
— “$15.00” (4)
— s
— (1)
正则表达式与 BeautifulSoup
python中的正则可以参照我的另一篇《 Python基础(9)--正则表达式》
注意到上面的实例网页中有如下结构:
假如有个需求是提取所有的img标签,按照之前的说法,可以考虑 findAll("img")来解决这个问题,但是现代网站有的隐藏img……等不确定因素,这时候才有正则表达式来解决:
from urllib.requestimport urlopenfrom bs4import BeautifulSoupimport re html = urlopen("http://www.pythonscraping.com/pages/page3.html")bsObj = BeautifulSoup(html, "html.parser")images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})for imagein images: print(image["src"])
运行结果如下:
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
作者:工学1号馆
出处: http://wuyudong.com/1842.html
本文版权归作者所有,欢迎转载,在文章页面明显位置给出原文链接,否则保留追究法律责任的权利.
如果觉得本文对您有帮助,可以对作者进行小额【赞助】

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

The article discusses using HTML5 form validation attributes like required, pattern, min, max, and length limits to validate user input directly in the browser.

The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

Article discusses best practices for ensuring HTML5 cross-browser compatibility, focusing on feature detection, progressive enhancement, and testing methods.

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

This article explains the HTML5 <time> element for semantic date/time representation. It emphasizes the importance of the datetime attribute for machine readability (ISO 8601 format) alongside human-readable text, boosting accessibilit
