python如何正确抓取网页标题

Question

通过 urllib 将网页内容抓取下来，然后用正则表达式 re 模块将标题匹配出来，但是发现部分标题会出现问题，比如下面抓 Apple 的代码运行结果是 App，测试发现匹配结果 m 是没有问题的，问题出现在了 strip() 这里...

伊谢尔伦 · Answer

有一个简单的错误。HTML文件不能用正则表达式parse，因为他的文法比正则表达式高级，具体原因参考这里。
推荐解析这种HTML用一些第三方库，例如mechanize
我的代码如下：

import mechanize
import cookielib
if __name__=='__main__':
    br = mechanize.Browser()
    br.set_cookiejar(cookielib.LWPCookieJar()) # Cookie jar
    
    br.set_handle_equiv(True) # Browser Option
    br.set_handle_gzip(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)
    
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
    
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
    br.open("http://apple.com")
    print br.title()

输出为Apple
对于mechanize的详细使用，参考这里

安装mechanize，就easy_install一下就好。

PHP中文网 · Answer

通用的方法是使用htmlparser解析.

比如使用lxml扩展包来解析:

from lxml import html
doc = html.parse('http://www.apple.com/')
title = doc.find('.//title').text
print title

或者使用BeautifulSoup来解析:

import urllib
from BeautifulSoup import BeautifulSoup
content = urllib.urlopen('http://www.apple.com/').read()
soup = BeautifulSoup(content)
print soup.find('title')

怪我咯 · Answer

re.findall(r"(.*)","Apple")

正则有一个分组功能。。。。。。。

PHPz · Answer

关键是用()进行分组提取，使用.*不一定匹配上。因为.*代表的含义是一组任意字符，但不包括换行符。

黄舟 · Answer

pattern = re.compile((?<=)[\w\W]*(?=))
pattern.search("Apple")

主要是(?<=...)和(?=...)这两个表达式

ringa_lee · Answer

这是strip的help

`Help on method_descriptor:

strip(...)
S.strip([chars]) -> string or unicode

Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping`

title中包涵le, 所以apple里的le被strip掉了

PHP中文网 · Answer

如果是使用正则解析，可以用如下方法

html = urllib.urlopen('http://apple.com').read()
m = re.search(r'(.*)', html, flags=re.I)
print 'Title: ', m and m.group(1) or ''

或者可以使用 pyquery

#-*0 coding: utf-8 -*- 
from pyquery import PyQuery as pq

d = pq(url='http://apple.com')
print 'Title: ', d('title').text()

阿神 · Answer

strip 会把头尾的都干掉吧

Php8, saya datang juga

Ketahui reka letak tapak web dalam masa 30 minit

Tutorial Video Permulaan Shangguan Oracle kepada Mahir

Baris pertama kod UNI-APP anda

Berkibar dari awal ke pelancaran apl

Tutorial Video Linux Baharu Brother Lian

Tutorial Video AXURE 9 (Sesuai untuk UI Reka Bentuk Produk Interaktif Pengurus Produk)

Tutorial Video PS Kemahiran Asas Sifar

Tutorial video UI 16 hari untuk anda bermula

Tutorial Video Teknik PS dan Teknik Menghiris

Tutorial Video Pembinaan Persekitaran Awan Alibaba dan Pelancaran Projek

Gambaran Keseluruhan Rangkaian Komputer - Pengetahuan Asas yang Perlu Dikuasai Pengaturcara

Tutorial Penting untuk Pengaturcara - Penjelasan Protokol HTTP

Tutorial Video Websocket