This plug-in can easily view many contents including HTML
Open the top 250 Douban movie rankings webpage, and find that there are 25 movies on each page, a total of 10 pages, and the URL of each page has the following characteristics:
http://movie .douban.com/top250?start=0
http://movie.douban.com/top250?start=25
http://movie.douban.com/top250?start=50
http:// movie.douban.com/top250?start=75
...
and so on, so you only need to use a loop to process the following 0, 25,...225.
Click on any Chinese movie name on the web page, right-click the mouse and "View Element" to view the HTML source code:
You can find that the movie name is placed in , and the English name is also placed in .
You can use regular expressions (.*) to match the Chinese and English names of movies, but here you only want to get the Chinese name, so you need to filter the English name.
The filtering method can be implemented using the find(str,pos_start,pos_end) function to eliminate the unique features in English names: ‘ ’ and ‘/’, see the code for details.
3. Code implementation
The code here is relatively simple, so there is no need to define functions.
#!/usr/bin/python # -*- coding: utf-8 -*- # import requests,sys,re from bs4 import BeautifulSoup reload(sys) sys.setdefaultencoding('utf-8') print '正在从豆瓣电影Top250抓取数据......' for page in range(10): url='https://movie.douban.com/top250?start='+str((page-1)*25) print '---------------------------正在爬取第'+str(page+1)+'页......--------------------------------' html=requests.get(url) html.raise_for_status() try: soup=BeautifulSoup(html.text,'html.parser') soup=str(soup) # 利用正则表达式需要将网页文本转换成字符串 title=re.compile(r'<span class="title">(.*)</span>') names=re.findall(title,soup) for name in names: if name.find(' ')==-1 and name.find('/')==-1: # 剔除英文名(英文名特征是含有' '和'/') print name # 创建名称,评分 except Exception as e: print e print '爬取完毕!'