Example of data capture from Sina News details page
The previous article "Python Crawler: Capturing Sina News Data" explained in detail how to crawl the relevant data of Sina News details page, but the construction of the code is not conducive to subsequent expansion. Every time a new details page is grabbed, It needs to be written again, so we need to organize it into functions so that it can be called directly.
The 6 data captured by the details page: news title, number of comments, time, source, text, and editor in charge.
First, we organize the number of comments into a functional form:
1 import requests 2 import json 3 import re 4 5 comments_url = '{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20' 6 7 def getCommentsCount(newsURL): 8 ID = re.search('doc-i(.+).shtml', newsURL) 9 newsID = ID.group(1)10 commentsURL = requests.get(comments_url.format(newsID))11 commentsTotal = json.loads(commentsURL.text.strip('var data='))12 return commentsTotal['result']['count']['total']13 14 news = ''15 print(getCommentsCount(news))
Line 5 comments_url, in the previous article, we Knowing that there is a news ID in the comment link, the number of comments on different news changes through the transformation of the news ID, so we format it and replace the news ID with braces {};
defines the number of comments to obtain The function getCommentsCount uses regular rules to find the matching news ID, then stores the obtained news link in the variable commentsURL, and gets the final number of comments commentsTotal by decoding JS;
Then, we only need to enter the new News link, you can directly call the function getCommentsCount to get the number of comments.
Finally, we organize the 6 data that need to be captured into a function getNewsDetail. As follows:
1 from bs4 import BeautifulSoup 2 import requests 3 from datetime import datetime 4 import json 5 import re 6 7 comments_url = '{}&group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20' 8 9 def getCommentsCount(newsURL):10 ID = re.search('doc-i(.+).shtml', newsURL)11 newsID = ID.group(1)12 commentsURL = requests.get(comments_url.format(newsID))13 commentsTotal = json.loads(commentsURL.text.strip('var data='))14 return commentsTotal['result']['count']['total']15 16 # news = 'http://news.sina.com.cn/c/nd/2017-05-14/doc-ifyfeius7904403.shtml'17 # print(getCommentsCount(news))18 19 def getNewsDetail(news_url):20 result = {}21 web_data = requests.get(news_url)22 web_data.encoding = 'utf-8'23 soup = BeautifulSoup(web_data.text,'lxml')24 result['title'] = soup.select('#artibodyTitle')[0].text25 result['comments'] = getCommentsCount(news_url)26 time = soup.select('.time-source')[0].contents[0].strip()27 result['dt'] = datetime.strptime(time,'%Y年%m月%d日%H:%M')28 result['source'] = soup.select('.time-source span span a')[0].text29 result['article'] = ' '.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])30 result['editor'] = soup.select('.article-editor')[0].text.lstrip('责任编辑:')31 return result32 33 print(getNewsDetail(''))
In the function getNewsDetail, obtain the 6 data that need to be captured and put them in the result:
result['title'] is to get the news title;
resul['comments'] is to get the number of comments. You can directly call the comment count function getCommentsCount we defined at the beginning. ;
result['dt'] is the acquisition time; result['source'] is the acquisition source;
result['article' ] is to get the text;
#result['editor'] is to get the editor in charge.
Then enter the news link you want to obtain data from and call this function.
Part of the results:
##{'title': 'The "instructor" who started teaching Wing Chun at the High School Affiliated to Zhejiang University is a third-generation disciple of Ip Man', ' comments': 618, 'dt': datetime.datetime(2017, 5, 14, 7, 22), 'source': 'China News Network', 'article': 'Original title: Zhejiang University Affiliated High School to start teaching Wing Chun "teacher" "This is Ip Man... Source: Qianjiang Evening News', 'editor': 'Zhang Di'}
The above is the detailed content of Example of data capture from Sina News details page. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

Getting started with Python: Hourglass Graphic Drawing and Input Verification This article will solve the variable definition problem encountered by a Python novice in the hourglass Graphic Drawing Program. Code...

Choice of Python Cross-platform desktop application development library Many Python developers want to develop desktop applications that can run on both Windows and Linux systems...

Many developers rely on PyPI (PythonPackageIndex)...

Data Conversion and Statistics: Efficient Processing of Large Data Sets This article will introduce in detail how to convert a data list containing product information to another containing...

How to handle high resolution images in Python to find white areas? Processing a high-resolution picture of 9000x7000 pixels, how to accurately find two of the picture...

When using Python to connect to an FTP server, you may encounter encoding problems when obtaining files in the specified directory and downloading them, especially text on the FTP server...
