Most of the Python tutorials on the Internet are version 2.X. Compared with python3.X, python2.X has changed a lot. Many libraries are used differently. I installed python3.X. Let’s take a look at the details. Example
0x01
I had nothing to do during the Spring Festival (how free I am), so I wrote a simple program to read some jokes and record the process of writing the program. The first time I came into contact with crawlers was when I saw a post like this. It was a funny post about crawling photos of girls on Omelette. It was not very convenient. So I just took pictures of cats and tigers myself.
Technology is inspiring the future. As a programmer, how can you do such a thing? It is better to make jokes that are better for your physical and mental health.
0x02
Before we roll up our sleeves and get started, let’s popularize some theoretical knowledge.
To put it simply, we need to pull down the content at a specific location on the web page. How to pull it out? We must first analyze the web page to see which piece of content we need. For example, what we crawled this time is jokes from the hilarious website. When we open the jokes page of the hilarious website, we can see a lot of jokes. Our purpose is to obtain this content. Come back and calm down after reading it. If you keep laughing like this, we can't write code. In chrome, we open Inspect Element and then expand the HTML tags level by level, or click the small mouse to locate the element we need.
Finally, we can find that the content in
is the joke we need. The same is true when looking at the second joke. So, we can find all the
in this webpage, and then extract the content inside, and we're done.
0x03
Okay, now that we know our purpose, we can roll up our sleeves and get started. I use python3 here. Regarding the choice of python2 and python3, everyone can decide by themselves. The functions can be realized, but there are some differences. But it is still recommended to use python3.
We want to pull down the content we need. First we have to pull down this web page. How to pull it down? Here we need to use a library called urllib. We use the method provided by this library to get the entire web page.
First, we import urllib
##Copy code The code is as follows:
import urllib.request as requestThen, we can use request to get the web page,Copy code The code is as follows:
def getHTML(url ):return request.urlopen(url).read()
After downloading the web page, we have to parse the web page to get the elements we need. In order to parse elements, we need to use another tool called Beautiful Soup. Using it, we can quickly parse HTML and XML and get the elements we need.
Copy code The code is as follows:
soup = BeautifulSoup(getHTML("http://www.pengfu.com/xiaohua_1 .html"))Using BeautifulSoup to parse web pages is just one sentence, but when you run the code, such a warning will appear, prompting you to specify a parser. Otherwise, it may not work on other platforms or An error is reported on the system.Copy code The code is as follows:
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5 /site-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.The code that caused this warning is on line 64 of the file joke.py. To get rid of this warning, change code that looks like this: BeautifulSoup([your markup])to this: BeautifulSoup([your markup], "lxml" ) markup_type=markup_type))The types of parsers and the differences between different parsers are explained in detail in the official documents. At present, it is more reliable to use lxml parsing.After modification
Copy code The code is as follows:
soup = BeautifulSoup(getHTML("http://www. pengfu.com/xiaohua_1.html", 'lxml'))In this way, there will be no above warning.
Copy code The code is as follows:
p_array = soup.find_all('p', {'class':"content- img clearfix pt10 relative"})
Use the find_all function to find all p tags of class = content-img clearfix pt10 relative and then traverse this array
Copy code The code is as follows:
for x in p_array: content = x.string
In this way, we get the content of destination p. At this point, we have achieved our goal and climbed to our joke.
But when crawling in the same way, such an error will be reported
Copy code The code is as follows:
raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response
It said that the remote end did not respond, closed the link, and checked the network and there was no problem. What is causing this? Is my posture wrong?
Nothing happens when I open Charles to capture the packet. Alas, this is strange. How come a good website can be accessed by python? Is it a problem with UA? After looking at Charles, I found that for requests initiated using urllib, the default UA is Python-urllib/3.5, and when accessing UA in chrome, it is User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36, could it be because the server rejects the python crawler based on UA? Let’s try to disguise it and see if it works
Copy code The code is as follows:
def getHTML(url):
headers = {'User-Agent': 'User-Agent:Mozilla /5.0 (Macintosh; Intel Mac OS .urlopen(req).read()
In this way, python is disguised as chrome to obtain the webpage of Qibai, and the data can be obtained smoothly.
At this point, use python to crawl Qibaihe. The jokes on Pingbelly.com are over. We only need to analyze the corresponding web pages, find the elements we are interested in, and use the powerful functions of python to achieve our goals. Whether it is XXOO pictures or connotation jokes, we can do it with one click. , I won’t talk anymore, I’ll look for some pictures of girls.
# -*- coding: utf-8 -*- import sys import urllib.request as request from bs4 import BeautifulSoup def getHTML(url): headers = {'User-Agent': 'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'} req = request.Request(url, headers=headers) return request.urlopen(req).read() def get_pengfu_results(url): soup = BeautifulSoup(getHTML(url), 'lxml') return soup.find_all('p', {'class':"content-img clearfix pt10 relative"}) def get_pengfu_joke(): for x in range(1, 2): url = 'http://www.pengfu.com/xiaohua_%d.html' % x for x in get_pengfu_results(url): content = x.string try: string = content.lstrip() print(string + '\n\n') except: continue return def get_qiubai_results(url): soup = BeautifulSoup(getHTML(url), 'lxml') contents = soup.find_all('p', {'class':'content'}) restlus = [] for x in contents: str = x.find('span').getText('\n','<br/>') restlus.append(str) return restlus def get_qiubai_joke(): for x in range(1, 2): url = 'http://www.qiushibaike.com/8hr/page/%d/?s=4952526' % x for x in get_qiubai_results(url): print(x + '\n\n') return if __name__ == '__main__': get_pengfu_joke() get_qiubai_joke()
For more python3 production of funny web page crawlers and related articles, please pay attention to the PHP Chinese website!