python利用beautifulSoup实现爬虫
以前讲过利用phantomjs做爬虫抓网页 www.jb51.net/article/55789.htm 是配合选择器做的
利用 beautifulSoup(文档 :www.crummy.com/software/BeautifulSoup/bs4/doc/)这个python模块,可以很轻松的抓取网页内容
# coding=utf-8 import urllib from bs4 import BeautifulSoup url ='http://www.baidu.com/s' values ={'wd':'网球'} encoded_param = urllib.urlencode(values) full_url = url +'?'+ encoded_param response = urllib.urlopen(full_url) soup =BeautifulSoup(response) alinks = soup.find_all('a')
上面可以抓取百度搜出来结果是网球的记录。
beautifulSoup内置了很多非常有用的方法。
几个比较好用的特性:
构造一个node元素
代码如下:
soup = BeautifulSoup(' Extremely bold ') tag = soup.b type(tag) #
属性可以使用attr拿到,结果是字典
代码如下:
tag.attrs # {u'class': u'boldest'}
或者直接tag.class取属性也可。
也可以自由操作属性
tag['class'] = 'verybold' tag['id'] = 1 tag #Extremely bolddel tag['class'] del tag['id'] tag #Extremely boldtag['class'] # KeyError: 'class' print(tag.get('class')) # None
还可以随便操作,查找dom元素,比如下面的例子
1.构建一份文档
html_doc = """The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie; and they lived at the bottom of a well....""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc)
2.各种搞
soup.head #The Dormouse's storysoup.title #The Dormouse's storysoup.body.b # The Dormouse's storysoup.a # Elsiesoup.find_all('a') # [Elsie, # Lacie, # Tillie] head_tag = soup.head head_tag #The Dormouse's storyhead_tag.contents [The Dormouse's story] title_tag = head_tag.contents[0] title_tag #The Dormouse's storytitle_tag.contents # [u'The Dormouse's story'] len(soup.contents) # 1 soup.contents[0].name # u'html' text = title_tag.contents[0] text.contents for child in title_tag.children: print(child) head_tag.contents # [The Dormouse's story] for child in head_tag.descendants: print(child) #The Dormouse's story# The Dormouse's story len(list(soup.children)) # 1 len(list(soup.descendants)) # 25 title_tag.string # u'The Dormouse's story'

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











The time it takes to learn Python crawlers varies from person to person and depends on factors such as personal learning ability, learning methods, learning time and experience. Learning Python crawlers is not just about learning the technology itself, but also requires good information gathering skills, problem solving skills and teamwork skills. Through continuous learning and practice, you will gradually grow into an excellent Python crawler developer.

In crawler development, handling cookies is often an essential part. As a state management mechanism in HTTP, cookies are usually used to record user login information and behavior. They are the key for crawlers to handle user authentication and maintain login status. In PHP crawler development, handling cookies requires mastering some skills and paying attention to some pitfalls. Below we explain in detail how to handle cookies in PHP. 1. How to get Cookie when writing in PHP

Analysis of common problems and solutions for PHP crawlers Introduction: With the rapid development of the Internet, the acquisition of network data has become an important link in various fields. As a widely used scripting language, PHP has powerful capabilities in data acquisition. One of the commonly used technologies is crawlers. However, in the process of developing and using PHP crawlers, we often encounter some problems. This article will analyze and give solutions to these problems and provide corresponding code examples. 1. Description of the problem that the data of the target web page cannot be correctly parsed.

Java crawler practice: How to efficiently crawl web page data Introduction: With the rapid development of the Internet, a large amount of valuable data is stored in various web pages. To obtain this data, it is often necessary to manually access each web page and extract the information one by one, which is undoubtedly a tedious and time-consuming task. In order to solve this problem, people have developed various crawler tools, among which Java crawler is one of the most commonly used. This article will lead readers to understand how to use Java to write an efficient web crawler, and demonstrate the practice through specific code examples. 1. The base of the reptile

The stock market has always been a topic of great concern. The daily rise, fall and changes in stocks directly affect investors' decisions. If you want to understand the latest developments in the stock market, you need to obtain and analyze stock information in a timely manner. The traditional method is to manually open major financial websites to view stock data one by one. This method is obviously too cumbersome and inefficient. At this time, crawlers have become a very efficient and automated solution. Next, we will demonstrate how to use PHP to write a simple stock crawler program to obtain stock data. allow

Request and BeautifulSoup are Python libraries that can download any file or PDF online. The requests library is used to send HTTP requests and receive responses. BeautifulSoup library is used to parse the HTML received in the response and get the downloadable pdf link. In this article, we will learn how to download PDF using Request and BeautifulSoup in Python. Install dependencies Before using BeautifulSoup and Request libraries in Python, we need to install these libraries in the system using the pip command. To install request and the BeautifulSoup and Request libraries,

Practical skills sharing: Quickly learn how to crawl web page data with Java crawlers Introduction: In today's information age, we deal with a large amount of web page data every day, and a lot of this data may be exactly what we need. In order to quickly obtain this data, learning to use crawler technology has become a necessary skill. This article will share a method to quickly learn how to crawl web page data with a Java crawler, and attach specific code examples to help readers quickly master this practical skill. 1. Preparation work Before starting to write a crawler, we need to prepare the following

With the rapid development of Internet technology, Web applications are increasingly used in our daily work and life. In the process of web application development, crawling web page data is a very important task. Although there are many web scraping tools on the market, these tools are not very efficient. In order to improve the efficiency of web page data crawling, we can use the combination of PHP and Selenium. First, we need to understand what PHP and Selenium are. PHP is a powerful
