beautifulsoup learning-crawl the page and parse_html/css

Home

Web Front-end

HTML Tutorial

beautifulsoup learning-crawl the page and parse_html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 24, 2016 am 11:49 AM

以汽车之家为例子，抓取页面并进行解析

Copy after login

# -*- coding=utf-8 -*-import urllib2from BeautifulSoup import BeautifulSoup as bs3import jsonimport codecs#字符检测，用来检测其真实的编码格式import chardet#save content to filedef save_to_file(filename, content):	f = open(filename, 'w+')	assert(f)	f.write(content)	f.close()	def parse_json_data(content):	print(chardet.detect(content[0]))		name_list = ['keyLink', 'config', 'option','color', 'innerColor']	print(json.dumps(content[0].decode('GB2312')))def parse_content(content):	#content是GB2312的编码	soup = bs3(content)		key_text = 'var levelId'	elem_lib = soup.find('script', text=lambda(x):key_text in x)		#str_script是utf-8的编码	str_script = str(elem_lib.string)		#print(chardet.detect(str_script))		#由于命令行是cp936 GBK的编码，如果编码不符合无法打印	strGBK = str_script.decode('utf-8').encode('gb2312')	#print(strGBK)		#移除html的转义字符 	strGBK = strGBK.replace(' ','')		d = strGBK.splitlines()	list_data = []		for i in d:		if i.isspace():			continue				#过滤不需要的变量		if len(i) < 100:			continue				#取出json数据		idx = i.find('{')		if idx == -1:			continue				#移除最后的;		k = i[idx:-1]		list_data.append(k)		parse_json_data(list_data)		'''	print('json.count=', len(list_data))	for i in list_data:		if len(i) > 200:			print(i[0:200])		else:			print(i)		parse_json_data(list_data)	'''		#不能再函数中直接使用exec，但是可以使用eval	'''	strSentece = ''	for i in d:		if i.isspace():			continue				if 'null' in j:			continue				#移除var的类型定义,javascript需要,python不需要		j = i[4:]				strSentece += i		#可以直接在python中执行json的赋值语句，类似dict赋值	exec(strSentece)		#输出变量数据	var_list = ['keyLink', 'config','option','color','innerColor']	for i in var_list:		exec('print %s' % (i,))	'''		def crawler_4_autohome():	autohome_url = 'http://car.autohome.com.cn/config/series/657.html'		#uft-8	content = urllib2.urlopen(url=autohome_url).read()	#print(chardet.detect(content))	parse_content(content)		if __name__ == '__main__':	crawler_4_autohome()

Copy after login

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Two Point Museum: All Exhibits And Where To Find Them

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7374

Java Tutorial

1628

CakePHP Tutorial

1355

Laravel Tutorial

1267

PHP Tutorial

1215

Related knowledge

What is the purpose of the <datalist> element? Mar 21, 2025 pm 12:33 PM

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

How do I use HTML5 form validation attributes to validate user input? Mar 17, 2025 pm 12:27 PM

The article discusses using HTML5 form validation attributes like required, pattern, min, max, and length limits to validate user input directly in the browser.

What is the purpose of the <progress> element? Mar 21, 2025 pm 12:34 PM

The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

What is the purpose of the <iframe> tag? What are the security considerations when using it? Mar 20, 2025 pm 06:05 PM

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

What are the best practices for cross-browser compatibility in HTML5? Mar 17, 2025 pm 12:20 PM

Article discusses best practices for ensuring HTML5 cross-browser compatibility, focusing on feature detection, progressive enhancement, and testing methods.

What is the purpose of the <meter> element? Mar 21, 2025 pm 12:35 PM

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

What is the viewport meta tag? Why is it important for responsive design? Mar 20, 2025 pm 05:56 PM

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

How do I use the HTML5 <time> element to represent dates and times semantically? Mar 12, 2025 pm 04:05 PM

This article explains the HTML5 <time> element for semantic date/time representation. It emphasizes the importance of the datetime attribute for machine readability (ISO 8601 format) alongside human-readable text, boosting accessibilit

See all articles