


Python crawling method of Anjuke second-hand housing website data sharing
This article mainly brings you a python crawling of Anjuke second-hand housing website data (explanation with examples). The editor thinks it’s pretty good, so I’ll share it with you now and give it as a reference. Let’s follow the editor to take a look, I hope it can help everyone.
Now we will start to officially write the crawler. First, we need to analyze the structure of the website to be crawled: As a student in Henan, let’s take a look at the second-hand housing information in Zhengzhou!
In the above page, we can see the property information one by one. From the above, we can see the property information one by one on the web page. After clicking in, you will find:
Details of the property. OK! So what are we going to do? That is to get all the second-hand housing information in Zhengzhou and save it in the database. What is it used for? As a geographer, it is still somewhat useful. I won’t go into it this time. Okay, let’s officially start. First, I use the requests and BeautifulSoup modules in python3.6 to crawl the page. First, the requests module makes the request:
# 网页的请求头 header = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36' } # url链接 url = 'https://zhengzhou.anjuke.com/sale/' response = requests.get(url, headers=header) print(response.text)
After execution, You will get the html code of this website
Through analysis, you can get that each house is in the li tag of class="list-item", then we can proceed based on the BeautifulSoup package Extracting
# 通过BeautifulSoup进行解析出每个房源详细列表并进行打印 soup = BeautifulSoup(response.text, 'html.parser') result_li = soup.find_all('li', {'class': 'list-item'}) for i in result_li: print(i)
can further reduce the amount of code by printing. OK, continue to extract
# 通过BeautifulSoup进行解析出每个房源详细列表并进行打印 soup = BeautifulSoup(response.text, 'html.parser') result_li = soup.find_all('li', {'class': 'list-item'}) # 进行循环遍历其中的房源详细列表 for i in result_li: # 由于BeautifulSoup传入的必须为字符串,所以进行转换 page_url = str(i) soup = BeautifulSoup(page_url, 'html.parser') # 由于通过class解析的为一个列表,所以只需要第一个参数 result_href = soup.find_all('a', {'class': 'houseListTitle'})[0] print(result_href.attrs['href'])
. In this way, we You can see the URLs one by one. Do you like it?
Okay, according to normal logic, you have to enter the page and start analyzing the detailed page, but how to proceed to the next page after crawling? So, we need to first analyze whether the page has a next page
The same method can be used to find that the next page is also so simple, then we You can continue with the original recipe and original taste
# 进行下一页的爬取 result_next_page = soup.find_all('a', {'class': 'aNxt'}) if len(result_next_page) != 0: print(result_next_page[0].attrs['href']) else: print('没有下一页了')
Because when the next page exists, there is an a tag in the web page. If not, it will become i tag, so this will do. Therefore, we can improve it and encapsulate the above into a function
import requests from bs4 import BeautifulSoup # 网页的请求头 header = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36' } def get_page(url): response = requests.get(url, headers=header) # 通过BeautifulSoup进行解析出每个房源详细列表并进行打印 soup = BeautifulSoup(response.text, 'html.parser') result_li = soup.find_all('li', {'class': 'list-item'}) # 进行下一页的爬取 result_next_page = soup.find_all('a', {'class': 'aNxt'}) if len(result_next_page) != 0: # 函数进行递归 get_page(result_next_page[0].attrs['href']) else: print('没有下一页了') # 进行循环遍历其中的房源详细列表 for i in result_li: # 由于BeautifulSoup传入的必须为字符串,所以进行转换 page_url = str(i) soup = BeautifulSoup(page_url, 'html.parser') # 由于通过class解析的为一个列表,所以只需要第一个参数 result_href = soup.find_all('a', {'class': 'houseListTitle'})[0] # 先不做分析,等一会进行详细页面函数完成后进行调用 print(result_href.attrs['href']) if __name__ == '__main__': # url链接 url = 'https://zhengzhou.anjuke.com/sale/' # 页面爬取函数调用 get_page(url)
Okay, then let’s start the detailed page Crawled
Hey, the power is always cut off, what a trap in the university, I will attach the results first, I will add more when I have free time,
import requests from bs4 import BeautifulSoup # 网页的请求头 header = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36' } def get_page(url): response = requests.get(url, headers=header) # 通过BeautifulSoup进行解析出每个房源详细列表并进行打印 soup_idex = BeautifulSoup(response.text, 'html.parser') result_li = soup_idex.find_all('li', {'class': 'list-item'}) # 进行循环遍历其中的房源详细列表 for i in result_li: # 由于BeautifulSoup传入的必须为字符串,所以进行转换 page_url = str(i) soup = BeautifulSoup(page_url, 'html.parser') # 由于通过class解析的为一个列表,所以只需要第一个参数 result_href = soup.find_all('a', {'class': 'houseListTitle'})[0] # 详细页面的函数调用 get_page_detail(result_href.attrs['href']) # 进行下一页的爬取 result_next_page = soup_idex.find_all('a', {'class': 'aNxt'}) if len(result_next_page) != 0: # 函数进行递归 get_page(result_next_page[0].attrs['href']) else: print('没有下一页了') # 进行字符串中空格,换行,tab键的替换及删除字符串两边的空格删除 def my_strip(s): return str(s).replace(" ", "").replace("\n", "").replace("\t", "").strip() # 由于频繁进行BeautifulSoup的使用,封装一下,很鸡肋 def my_Beautifulsoup(response): return BeautifulSoup(str(response), 'html.parser') # 详细页面的爬取 def get_page_detail(url): response = requests.get(url, headers=header) if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') # 标题什么的一大堆,哈哈 result_title = soup.find_all('h3', {'class': 'long-title'})[0] result_price = soup.find_all('span', {'class': 'light info-tag'})[0] result_house_1 = soup.find_all('p', {'class': 'first-col detail-col'}) result_house_2 = soup.find_all('p', {'class': 'second-col detail-col'}) result_house_3 = soup.find_all('p', {'class': 'third-col detail-col'}) soup_1 = my_Beautifulsoup(result_house_1) soup_2 = my_Beautifulsoup(result_house_2) soup_3 = my_Beautifulsoup(result_house_3) result_house_tar_1 = soup_1.find_all('dd') result_house_tar_2 = soup_2.find_all('dd') result_house_tar_3 = soup_3.find_all('dd') ''' 文博公寓,省实验中学,首付只需70万,大三房,诚心卖,价可谈 270万 宇泰文博公寓 金水-花园路-文博东路4号 2010年 普通住宅 3室2厅2卫 140平方米 南北 中层(共32层) 精装修 19285元/m² 81.00万 ''' print(my_strip(result_title.text), my_strip(result_price.text)) print(my_strip(result_house_tar_1[0].text), my_strip(my_Beautifulsoup(result_house_tar_1[1]).find_all('p')[0].text), my_strip(result_house_tar_1[2].text), my_strip(result_house_tar_1[3].text)) print(my_strip(result_house_tar_2[0].text), my_strip(result_house_tar_2[1].text), my_strip(result_house_tar_2[2].text), my_strip(result_house_tar_2[3].text)) print(my_strip(result_house_tar_3[0].text), my_strip(result_house_tar_3[1].text), my_strip(result_house_tar_3[2].text)) if __name__ == '__main__': # url链接 url = 'https://zhengzhou.anjuke.com/sale/' # 页面爬取函数调用 get_page(url)
Since I wrote the code while blogging, I made some changes in the get_page function, that is, the recursive call for the next page needs to be placed after the function, and the two functions are encapsulated without introduction,
And the data is not written to mysql, so I will continue to follow up later, thank you!!!
Related recommendations:
python crawling article example tutorial
10 recommended articles about python crawling
Share a Python method to crawl popular comments on NetEase Cloud Music
The above is the detailed content of Python crawling method of Anjuke second-hand housing website data sharing. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

The speed of mobile XML to PDF depends on the following factors: the complexity of XML structure. Mobile hardware configuration conversion method (library, algorithm) code quality optimization methods (select efficient libraries, optimize algorithms, cache data, and utilize multi-threading). Overall, there is no absolute answer and it needs to be optimized according to the specific situation.

It is impossible to complete XML to PDF conversion directly on your phone with a single application. It is necessary to use cloud services, which can be achieved through two steps: 1. Convert XML to PDF in the cloud, 2. Access or download the converted PDF file on the mobile phone.

There is no built-in sum function in C language, so it needs to be written by yourself. Sum can be achieved by traversing the array and accumulating elements: Loop version: Sum is calculated using for loop and array length. Pointer version: Use pointers to point to array elements, and efficient summing is achieved through self-increment pointers. Dynamically allocate array version: Dynamically allocate arrays and manage memory yourself, ensuring that allocated memory is freed to prevent memory leaks.

XML can be converted to images by using an XSLT converter or image library. XSLT Converter: Use an XSLT processor and stylesheet to convert XML to images. Image Library: Use libraries such as PIL or ImageMagick to create images from XML data, such as drawing shapes and text.

An application that converts XML directly to PDF cannot be found because they are two fundamentally different formats. XML is used to store data, while PDF is used to display documents. To complete the transformation, you can use programming languages and libraries such as Python and ReportLab to parse XML data and generate PDF documents.

There is no APP that can convert all XML files into PDFs because the XML structure is flexible and diverse. The core of XML to PDF is to convert the data structure into a page layout, which requires parsing XML and generating PDF. Common methods include parsing XML using Python libraries such as ElementTree and generating PDFs using ReportLab library. For complex XML, it may be necessary to use XSLT transformation structures. When optimizing performance, consider using multithreaded or multiprocesses and select the appropriate library.

XML formatting tools can type code according to rules to improve readability and understanding. When selecting a tool, pay attention to customization capabilities, handling of special circumstances, performance and ease of use. Commonly used tool types include online tools, IDE plug-ins, and command-line tools.

Convert XML to PDF with high quality on your mobile phone requires: parsing XML in the cloud and generating PDFs using a serverless computing platform. Choose efficient XML parser and PDF generation library. Handle errors correctly. Make full use of cloud computing power to avoid heavy tasks on your phone. Adjust complexity according to requirements, including processing complex XML structures, generating multi-page PDFs, and adding images. Print log information to help debug. Optimize performance, select efficient parsers and PDF libraries, and may use asynchronous programming or preprocessing XML data. Ensure good code quality and maintainability.
