


A summary of the python standard library Beautiful Soup and MongoDb climbing Himalayan Radio
Beautiful Soup standard library is a Python library that can extract data from HTML/XML files. It can realize the usual way of document navigation, search and modification through your favorite converter. Beautiful Soup will save hours. working hours. The pymongo standard library is the bridge between the MongoDb NoSql database and the python language. Data is saved to MongoDb through pymongo. Use the two together to crawl the Himalayan radio data...
Beautiful Soup supports the HTML parser in the Python standard library, and also supports some third-party parsers, one of which is lxml. This article uses lxml. For the installation of this, please see the python 3.6 lxml standard library lxml installation and the use of etree. Note
At the same time, this article uses XPath to parse the part we want. For the introduction of XPath and Beautiful Soup For more information and usage, please see Beautiful Soup 4.4.0 Document XPath Introduction
The knowledge of Beautiful Soup and XPath involved in this article is not very deep. You can understand it by looking at the official documentation, and I also added comments...
Regarding the pymongo standard library, I will not talk too much. For details, please see the pymongo module experience of the python standard library
Sometimes, we need to determine the type of client currently making a request to the server, which is commonly known as User-Agent, referred to as UA. The browser we use when browsing the web is a type of UA. In other words, UA is the browser. In the HTTP protocol, the User-Agent request header indicates the type of the user's browser and the operation Identification of system, browser kernel and other information. Through this logo, different versions of the websites visited can be displayed to provide users with a better experience or to conduct information statistics. Some websites officially use UA to prevent hackers or boring people like us from crawling the website's data information.
Therefore, the code in this article first lists all UAs to facilitate subsequent crawling work.
Okay, let’s clarify what data we want to crawl and obtain:
What we need is the link to the image, alt, etc.
Then we click on the image link to get the details inside. If some radio stations have multiple pages, then we use xpath to access them one by one. At the same time, we obtain the sound_id of the sound module in the album on the page...
The procedure is as follows:
import random import requests from bs4 import BeautifulSoup import json from lxml import etree import pymongo clients = pymongo.MongoClient("localhost", 27017) db = clients["XiMaLaYa"] collection_1 = db["album"] collection_2 = db["detail"] UA_LIST = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] headers1 = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6', 'Cache-Control': 'max-age=0', 'Proxy-Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', 'User-Agent': random.choice(UA_LIST) # User_agence表示用户代理 } headers2 = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6', 'Cache-Control': 'max-age=0', 'Proxy-Connection': 'keep-alive', 'Referer': 'http://www.ximalaya.com/dq/all/2', 'Upgrade-Insecure-Requests': '1', 'User-Agent': random.choice(UA_LIST) } # Beautiful库用来处理XML和HTML... # 主要就是利用BeautifulSoup模块来处理requests模块获取的Html源码 # 利用lxml模块将html源码解析成树结构,xpath来处理树节点. def get_url(): start_urls = ["http://www.ximalaya.com/dq/all/{}".format(num) for num in range(1,85)] # start_urls = ["http://www.ximalaya.com/dq/all/1"] for start_url in start_urls: html = requests.get(start_url, headers=headers1).text soup = BeautifulSoup(html, "lxml") # 使用lxml来处理 for item in soup.find_all(class_="albumfaceOutter"): # 解析并查找xml节点 content = { 'href': item.a["href"], 'title': item.img['alt'], 'img_url': item.img['src'] } collection_1.insert(content) # another(item.a["href"]) print('写入完成...') # 进入电台具体页面 http://www.ximalaya.com/15836959/album/303085,并处理分页录音... def another(url): html = requests.get(url, headers=headers1).text # / :表示从根节点选取.... # // :表示匹配选择的当前节点选择文档中的节点,而不考虑他们的位置... ifanother = etree.HTML(html).xpath('//div[@class="pagingBar_wrapper"]/a[last()-1]/@data-page') # 页面链接地址 ifanother是list类型... if len(ifanother): # 判断一个video的录音是否分割成了多页.... num = ifanother[0] # 获取页面数... print('本频道保存在' + num + '个页面') for n in range(1, int(num)): url2 = url + '?page={}'.format(n) get_m4a(url2) get_m4a(url) # 获取分页录音页面的详细数据... def get_m4a(url): html = requests.get(url, headers=headers2).text numlist = etree.HTML(html).xpath('//div[@class="personal_body"]/@sound_ids')[0].split(',') for i in numlist: murl = 'http://www.ximalaya.com/tracks/{}.json'.format(i) html = requests.get(murl, headers=headers1).text dic = json.loads(html) collection_2.insert(dic) if __name__ == "__main__": get_url()
The above is the detailed content of A summary of the python standard library Beautiful Soup and MongoDb climbing Himalayan Radio. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

An application that converts XML directly to PDF cannot be found because they are two fundamentally different formats. XML is used to store data, while PDF is used to display documents. To complete the transformation, you can use programming languages and libraries such as Python and ReportLab to parse XML data and generate PDF documents.

Use most text editors to open XML files; if you need a more intuitive tree display, you can use an XML editor, such as Oxygen XML Editor or XMLSpy; if you process XML data in a program, you need to use a programming language (such as Python) and XML libraries (such as xml.etree.ElementTree) to parse.

XML formatting tools can type code according to rules to improve readability and understanding. When selecting a tool, pay attention to customization capabilities, handling of special circumstances, performance and ease of use. Commonly used tool types include online tools, IDE plug-ins, and command-line tools.

There is no simple and direct free XML to PDF tool on mobile. The required data visualization process involves complex data understanding and rendering, and most of the so-called "free" tools on the market have poor experience. It is recommended to use computer-side tools or use cloud services, or develop apps yourself to obtain more reliable conversion effects.

For small XML files, you can directly replace the annotation content with a text editor; for large files, it is recommended to use the XML parser to modify it to ensure efficiency and accuracy. Be careful when deleting XML comments, keeping comments usually helps code understanding and maintenance. Advanced tips provide Python sample code to modify comments using XML parser, but the specific implementation needs to be adjusted according to the XML library used. Pay attention to encoding issues when modifying XML files. It is recommended to use UTF-8 encoding and specify the encoding format.

Modifying XML content requires programming, because it requires accurate finding of the target nodes to add, delete, modify and check. The programming language has corresponding libraries to process XML and provides APIs to perform safe, efficient and controllable operations like operating databases.

The speed of mobile XML to PDF depends on the following factors: the complexity of XML structure. Mobile hardware configuration conversion method (library, algorithm) code quality optimization methods (select efficient libraries, optimize algorithms, cache data, and utilize multi-threading). Overall, there is no absolute answer and it needs to be optimized according to the specific situation.

To generate images through XML, you need to use graph libraries (such as Pillow and JFreeChart) as bridges to generate images based on metadata (size, color) in XML. The key to controlling the size of the image is to adjust the values of the <width> and <height> tags in XML. However, in practical applications, the complexity of XML structure, the fineness of graph drawing, the speed of image generation and memory consumption, and the selection of image formats all have an impact on the generated image size. Therefore, it is necessary to have a deep understanding of XML structure, proficient in the graphics library, and consider factors such as optimization algorithms and image format selection.
