Now there are many ways to write web crawlers, such as Node.js or Go, or even PHP. The reason why I chose Python is because there are many tutorials and you can learn systematically, because just knowing how to use Html selector to crawl pages is not enough. Not enough, I also want to learn some common pitfalls in the crawling process, as well as some precautions, such as small tips such as modifying the browser header.
The code comments are very detailed. In fact, you only need to read the source code directly.
The purpose of this crawler is very simple. It crawls to the property name + price + 1 picture download of a real estate website (simply testing the file download function) for later analysis of housing price trends. In order not to add too much to the other party's server So stressful, I only chose to crawl 3 pages.
Let me talk about a few knowledge points that need to be paid attention to:
#Remember to modify the headers sent
I heard that the headers sent by default are all headers with python information, which can easily be detected by the other website as a crawler robot , causing the IP to be blocked, so it is best to make your crawler program more like a human, but this code can only serve as a general concealment. If there are really websites that want to prevent crawlers, you can’t deceive them. Here is the code:
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"},
# For the html selector, I use pyquery instead of beautifulsoup
Many books recommend beautifulsoup, but as a person who is used to jquery, the syntax of beautifulsoup is really a bit awkward, and it seems that it does not support advanced and complex functions such as first-child. The css selector mode may be supported, but I couldn't find it and didn't read the documentation very carefully.
Then I searched for information on the Internet and found that many people recommended the pyquery library. I used it myself and found that it was really comfortable, so I adopted it decisively.
#Crawler Idea
The idea is actually very simple:
1. Find the list page of a certain property and analyze the URL structure of the second and third pages;
2. Get the URL of all the list entry information of each list page and save it In Python's set() collection, set is used to remove duplicate URL information.
3. Enter the details page by obtaining the URL of the house, and then crawl to valuable field information, such as pictures and text.
4. At present, I only print the data simply, and do not save the obtained data in local json or CSV format. I will do this later, to be done.
The following is the full code:
#获取页面对象 from urllib.request import urlopen from urllib.request import urlretrieve from pyquery import PyQuery as pq #修改请求头模块,模拟真人访问 import requests import time #引入系统对象 import os #你自己的配置文件,请将config-sample.py重命名为config.py,然后填写对应的值即可 import config #定义链接集合,以免链接重复 pages = set() session = requests.Session() baseUrl = 'http://pic1.ajkimg.com' downLoadDir = 'images' #获取所有列表页连接 def getAllPages(): pageList = [] i = 1 while(i < 2): newLink = 'http://sh.fang.anjuke.com/loupan/all/p' + str(i) + '/' pageList.append(newLink) i = i + 1 return pageList def getAbsoluteURL(baseUrl, source): if source.startswith("http://www."): url = "http://"+source[11:] elif source.startswith("http://"): url = source elif source.startswith("www."): url = "http://"+source[4:] else: url = baseUrl+"/"+source if baseUrl not in url: return None return url #这个函数内部的路径按照自己的真实情况来写,方便之后的数据导入 def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory): path = absoluteUrl.replace("www.", "") path = path.replace(baseUrl, "") path = downloadDirectory+path directory = os.path.dirname(path) if not os.path.exists(directory): os.makedirs(directory) return path #获取当前页面的所有连接 def getItemLinks(url): global pages; #先判断是否能获取页面 try: req = session.get(url, headers = config.value['headers']) #这个判断只能判定是不是404或者500的错误,如果DNS没法解析,是无法判定的 except IOError as e: print('can not reach the page. ') print(e) else: h = pq(req.text) #获取第一页的所有房子模块 houseItems = h('.item-mod') #从模块中提取我们需要的信息,比如详情页的URL,价格,略缩图等 #我倾向只获取详情页的URL,然后在详情页中获取更多的信息 for houseItem in houseItems.items(): houseUrl = houseItem.find('.items-name').attr('href') #print(houseUrl) pages.add(houseUrl) #获取详情页的各种字段,这里可以让用户自己编辑 def getItemDetails(url): #先判断是否能获取页面 try: req = session.get(url, headers = config.value['headers']) #这个判断只能判定是不是404或者500的错误,如果DNS没法解析,是无法判定的 except IOError as e: print('can not reach the page. ') print(e) else: time.sleep(1) h = pq(req.text) #get title housePrice = h('h1').text() if h('h1') != None else 'none' #get price housePrice = h('.sp-price').text() if h('.sp-price') != None else 'none' #get image url houseImage = h('.con a:first-child img').attr('src') houseImageUrl = getAbsoluteURL(baseUrl, houseImage) if houseImageUrl != None: urlretrieve(houseImageUrl, getDownloadPath(baseUrl, houseImageUrl, downLoadDir)) # if bsObj.find('em',{'class','sp-price'}) == None: # housePrice = 'None' # else: # housePrice = bsObj.find('em',{'class','sp-price'}).text; # if bsObj.select('.con a:first-child .item img')== None: # houseThumbnail = 'None' # else: # houseThumbnail = bsObj.select('.con a:first-child .item img'); #start to run the code allPages = getAllPages() for i in allPages: getItemLinks(i) #此时pages 应该充满了很多url的内容 for i in pages: getItemDetails(i) #print(pages)