


beautifulsoup learning-crawl the page and parse_html/css_WEB-ITnose
以汽车之家为例子,抓取页面并进行解析
# -*- coding=utf-8 -*-import urllib2from BeautifulSoup import BeautifulSoup as bs3import jsonimport codecs#字符检测,用来检测其真实的编码格式import chardet#save content to filedef save_to_file(filename, content): f = open(filename, 'w+') assert(f) f.write(content) f.close() def parse_json_data(content): print(chardet.detect(content[0])) name_list = ['keyLink', 'config', 'option','color', 'innerColor'] print(json.dumps(content[0].decode('GB2312')))def parse_content(content): #content是GB2312的编码 soup = bs3(content) key_text = 'var levelId' elem_lib = soup.find('script', text=lambda(x):key_text in x) #str_script是utf-8的编码 str_script = str(elem_lib.string) #print(chardet.detect(str_script)) #由于命令行是cp936 GBK的编码,如果编码不符合无法打印 strGBK = str_script.decode('utf-8').encode('gb2312') #print(strGBK) #移除html的转义字符 strGBK = strGBK.replace(' ','') d = strGBK.splitlines() list_data = [] for i in d: if i.isspace(): continue #过滤不需要的变量 if len(i) < 100: continue #取出json数据 idx = i.find('{') if idx == -1: continue #移除最后的; k = i[idx:-1] list_data.append(k) parse_json_data(list_data) ''' print('json.count=', len(list_data)) for i in list_data: if len(i) > 200: print(i[0:200]) else: print(i) parse_json_data(list_data) ''' #不能再函数中直接使用exec,但是可以使用eval ''' strSentece = '' for i in d: if i.isspace(): continue if 'null' in j: continue #移除var的类型定义,javascript需要,python不需要 j = i[4:] strSentece += i #可以直接在python中执行json的赋值语句,类似dict赋值 exec(strSentece) #输出变量数据 var_list = ['keyLink', 'config','option','color','innerColor'] for i in var_list: exec('print %s' % (i,)) ''' def crawler_4_autohome(): autohome_url = 'http://car.autohome.com.cn/config/series/657.html' #uft-8 content = urllib2.urlopen(url=autohome_url).read() #print(chardet.detect(content)) parse_content(content) if __name__ == '__main__': crawler_4_autohome()

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

The article discusses using HTML5 form validation attributes like required, pattern, min, max, and length limits to validate user input directly in the browser.

The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

Article discusses best practices for ensuring HTML5 cross-browser compatibility, focusing on feature detection, progressive enhancement, and testing methods.

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

This article explains the HTML5 <time> element for semantic date/time representation. It emphasizes the importance of the datetime attribute for machine readability (ISO 8601 format) alongside human-readable text, boosting accessibilit
