Full record of writing Python crawlers from scratch-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Full record of writing Python crawlers from scratch

PHP中文网

Jun 27, 2017 am 10:54 AM

python reptile

The first nine articles have been introduced in detail from the basics to the writing. The tenth article is about being perfect, so we will record in detail how to write a crawler program step by step. Please read it. Be careful

First let’s talk about our school’s website:

http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html

Check results You need to log in, and then the results of each subject are displayed, but only the results are displayed without the grade points, which is the weighted average score.

Obviously calculating grade points manually is a very troublesome thing. So we can use python to make a crawler to solve this problem.

1. On the eve of the decisive battle

Let’s prepare a tool first: HttpFox plug-in.

This is an http protocol analysis plug-in that analyzes page request and response time, content, and COOKIE used by the browser.

Take me as an example, just install it on Firefox, the effect is as shown:

You can view the corresponding information very intuitively.

Click start to start detection, click stop to pause detection, and click clear to clear the content.

Generally before use, click stop to pause, and then click clear to clear the screen to ensure that you see the data obtained by accessing the current page.

2. Go deep behind enemy lines

Let’s go to the score inquiry website of Shandong University to see what was sent when logging in. information.

First go to the login page, open httpfox, after clearing, click start to start the detection:

After entering the personal information, make sure httpfox is turned on. Then click OK to submit the information and log in.

You can see at this time that httpfox has detected three pieces of information:

At this time, click the stop button to ensure that what is captured is the feedback after visiting the page. Data so that we can simulate login when doing crawlers.

3. Pao Ding Jie Niu

At first glance, we got three data, two are GET and one is POST, but what exactly are they? How to use it, we still don’t know.

So, we need to check the captured content one by one.

Look at the POST information first:

Since it is the POST information, we can just look at the PostData.

You can see that there are two POST data, studid and pwd.

And it can be seen from the Redirect to of Type that after the POST is completed, it jumps to the bks_login2.loginmessage page.

It can be seen that this data is the form data submitted after clicking OK.

Click on the cookie label to see the cookie information:

Yes, an ACCOUNT cookie was received and will be automatically destroyed after the session ends. .

So what information did you receive after submitting?

Let’s take a look at the next two GET data.

Let’s look at the first one first. We click on the content tag to view the received content. Do you feel like eating it alive? -The HTML source code is undoubtedly exposed:

It seems that this is just the html source code of the page. Click on the cookie to view the cookie-related information:

Aha, it turns out that the content of the html page was received only after the cookie information was sent.

Let’s take a look at the last received message:

After a rough look, it should be just a css file called style.css, which doesn’t mean much to us. big effect.

4. Calmly respond

Now that we know what data we sent to the server and what data we received, the basic process is as follows ：

First, we POST the student ID and password--->Then return the cookie value and then send the cookie to the server--->Return the page information. Obtain the data from the grades page, use regular expressions to extract the grades and credits separately and calculate the weighted average.

OK, it looks like a very simple sample paper. Then let’s try it out.

But before the experiment, there is still an unresolved problem, which is where is the POST data sent?

Look at the original page again:

It is obviously implemented using an html framework, that is to say, the address we see in the address bar is not the address to submit the form on the right.

So how can we get the real address-. -Right-click to view the page source code:

Yes, that name="w_right" is the login page we want.

The original address of the website is:

http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html

So, the real form submission The address should be:

http://jwxt.sdu.edu.cn:7777/zhxt_bks/xk_login.html

After entering it, it is true:

It’s actually the course selection system of Tsinghua University. . . My guess is that our school was too lazy to create a page, so we just borrowed it. . As a result, the title was not even changed. . .

But this page is still not the page we need, because the page our POST data is submitted to should be the page submitted in the ACTION of the form.

In other words, we need to check the source code to know where the POST data is sent:

Well, visually this is the POST submission The address of the data.

Arrange it into the address bar. The complete address should be as follows:

http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login

(The method of obtaining it is very simple. Just click on the link in Firefox browser to see the address of the link)

5. Try it out

The next task is to use python to simulate sending a POST data and get the returned cookie value.

For the operation of cookies, you can read this blog post:

http://www.jb51.net/article/57144.htm

We first prepare a POST data, then prepare a cookie to receive, and then write the source code as follows:

# -*- coding: utf-8 -*-
#---------------------------------------
#   程序：山东大学爬虫
#   版本：0.1
#   作者：why
#   日期：2013-07-12
#   语言：Python 2.7
#   操作：输入学号和密码
#   功能：输出成绩的加权平均值也就是绩点
#---------------------------------------
import urllib  
import urllib2
import cookielib
cookie = cookielib.CookieJar()  
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
#需要POST的数据#
postdata=urllib.urlencode({  
    &#39;stuid&#39;:&#39;201100300428&#39;,  
    &#39;pwd&#39;:&#39;921030&#39;  
})
#自定义一个请求#
req = urllib2.Request(  
    url = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login&#39;,  
    data = postdata
)
#访问该链接#
result = opener.open(req)
#打印返回的内容#
print result.read()

Copy after login

After this, look at the effect of the operation:

ok, in this way, we can calculate that the simulated login is successful.

6. Stealing the day and changing the day

The next task is to use a crawler to obtain the students’ scores.

Let’s look at the source website again.

After opening HTTPFOX, click to view the results and find that the following data has been captured:

Click on the first GET data to view the content It is found that Content is the content of the obtained score.

For the obtained page link, right-click to view the element from the page source code, and you can see the page that jumps after clicking the link (in Firefox, you only need to right-click and "View this frame". ):

The link to view the results can be obtained as follows:

http://jwxt.sdu.edu.cn: 7777/pls/wwwbks/bkscjcx.curscopre

7. Everything is ready

Now everything is ready, so just apply the link to the crawler , see if you can view the results page.

As you can see from httpfox, we have to send a cookie to return the score information, so we use python to simulate the sending of a cookie to request the score information:

# -*- coding: utf-8 -*-
#---------------------------------------
#   程序：山东大学爬虫
#   版本：0.1
#   作者：why
#   日期：2013-07-12
#   语言：Python 2.7
#   操作：输入学号和密码
#   功能：输出成绩的加权平均值也就是绩点
#---------------------------------------
import urllib  
import urllib2
import cookielib
#初始化一个CookieJar来处理Cookie的信息#
cookie = cookielib.CookieJar()
#创建一个新的opener来使用我们的CookieJar#
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
#需要POST的数据#
postdata=urllib.urlencode({  
    &#39;stuid&#39;:&#39;201100300428&#39;,  
    &#39;pwd&#39;:&#39;921030&#39;  
})
#自定义一个请求#
req = urllib2.Request(  
    url = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login&#39;,  
    data = postdata
)
#访问该链接#
result = opener.open(req)
#打印返回的内容#
print result.read()
#打印cookie的值
for item in cookie:  
    print &#39;Cookie：Name = &#39;+item.name  
    print &#39;Cookie：Value = &#39;+item.value
    
#访问该链接#
result = opener.open(&#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre&#39;)
#打印返回的内容#
print result.read()

Copy after login

Press Just run F5 and take a look at the captured data:

Since there is no problem, use regular expressions to process the data a little bit , just take out the credits and corresponding scores.

8. Get it at your fingertips

Such a large amount of html source code is obviously not conducive to our processing. Next, we need to use regular expressions to extract the necessary data. .

For tutorials on regular expressions, you can read this blog post:

http://www.jb51.net/article/57150.htm

Let’s take a look at the results Source code:

In this case, using regular expressions is easy.

We will tidy up the code a little, and then use regular expressions to extract the data:

# -*- coding: utf-8 -*-
#---------------------------------------
#   程序：山东大学爬虫
#   版本：0.1
#   作者：why
#   日期：2013-07-12
#   语言：Python 2.7
#   操作：输入学号和密码
#   功能：输出成绩的加权平均值也就是绩点
#---------------------------------------
import urllib  
import urllib2
import cookielib
import re
class SDU_Spider:  
    # 申明相关的属性  
    def __init__(self):    
        self.loginUrl = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login&#39;   # 登录的url
        self.resultUrl = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre&#39; # 显示成绩的url
        self.cookieJar = cookielib.CookieJar()                                      # 初始化一个CookieJar来处理Cookie的信息
        self.postdata=urllib.urlencode({&#39;stuid&#39;:&#39;201100300428&#39;,&#39;pwd&#39;:&#39;921030&#39;})     # POST的数据
        self.weights = []   #存储权重，也就是学分
        self.points = []    #存储分数，也就是成绩
        self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookieJar))
    def sdu_init(self):
        # 初始化链接并且获取cookie
        myRequest = urllib2.Request(url = self.loginUrl,data = self.postdata)   # 自定义一个请求
        result = self.opener.open(myRequest)            # 访问登录页面，获取到必须的cookie的值
        result = self.opener.open(self.resultUrl)       # 访问成绩页面，获得成绩的数据
        # 打印返回的内容
        # print result.read()
        self.deal_data(result.read().decode(&#39;gbk&#39;))
        self.print_data(self.weights);
        self.print_data(self.points);
    # 将内容从页面代码中抠出来  
    def deal_data(self,myPage):  
        myItems = re.findall(&#39;<TR>.*?<p.*?<p.*?<p.*?<p.*?<p.*?>(.*?)</p>.*?<p.*?<p.*?>(.*?)</p>.*?</TR>&#39;,myPage,re.S)     #获取到学分
        for item in myItems:
            self.weights.append(item[0].encode(&#39;gbk&#39;))
            self.points.append(item[1].encode(&#39;gbk&#39;))
            
    # 将内容从页面代码中抠出来
    def print_data(self,items):  
        for item in items:  
            print item
#调用  
mySpider = SDU_Spider()  
mySpider.sdu_init()

Copy after login

The level is limited, and regular expressions are a bit ugly. The running effect is as shown in the figure:

#ok, the next thing is just the data processing problem. .

9. Return in triumph

The complete code is as follows. At this point, a complete crawler project is completed.

# -*- coding: utf-8 -*-
#---------------------------------------
#   程序：山东大学爬虫
#   版本：0.1
#   作者：why
#   日期：2013-07-12
#   语言：Python 2.7
#   操作：输入学号和密码
#   功能：输出成绩的加权平均值也就是绩点
#---------------------------------------
import urllib  
import urllib2
import cookielib
import re
import string
class SDU_Spider:  
    # 申明相关的属性  
    def __init__(self):    
        self.loginUrl = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login&#39;   # 登录的url
        self.resultUrl = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre&#39; # 显示成绩的url
        self.cookieJar = cookielib.CookieJar()                                      # 初始化一个CookieJar来处理Cookie的信息
        self.postdata=urllib.urlencode({&#39;stuid&#39;:&#39;201100300428&#39;,&#39;pwd&#39;:&#39;921030&#39;})     # POST的数据
        self.weights = []   #存储权重，也就是学分
        self.points = []    #存储分数，也就是成绩
        self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookieJar))
    def sdu_init(self):
        # 初始化链接并且获取cookie
        myRequest = urllib2.Request(url = self.loginUrl,data = self.postdata)   # 自定义一个请求
        result = self.opener.open(myRequest)            # 访问登录页面，获取到必须的cookie的值
        result = self.opener.open(self.resultUrl)       # 访问成绩页面，获得成绩的数据
        # 打印返回的内容
        # print result.read()
        self.deal_data(result.read().decode(&#39;gbk&#39;))
        self.calculate_date();
    # 将内容从页面代码中抠出来  
    def deal_data(self,myPage):  
        myItems = re.findall(&#39;<TR>.*?<p.*?<p.*?<p.*?<p.*?<p.*?>(.*?)</p>.*?<p.*?<p.*?>(.*?)</p>.*?</TR>&#39;,myPage,re.S)     #获取到学分
        for item in myItems:
            self.weights.append(item[0].encode(&#39;gbk&#39;))
            self.points.append(item[1].encode(&#39;gbk&#39;))
    #计算绩点，如果成绩还没出来，或者成绩是优秀良好，就不运算该成绩
    def calculate_date(self):
        point = 0.0
        weight = 0.0
        for i in range(len(self.points)):
            if(self.points[i].isdigit()):
                point += string.atof(self.points[i])*string.atof(self.weights[i])
                weight += string.atof(self.weights[i])
        print point/weight
#调用  
mySpider = SDU_Spider()  
mySpider.sdu_init()

Copy after login

The above is a detailed record of the entire process of the birth of this crawler. Is there any magic trick? ? Haha, just kidding, friends in need can refer to it and expand freely

The above is the detailed content of Full record of writing Python crawlers from scratch. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Saving in R.E.P.O. Explained (And Save Files)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7563

CakePHP Tutorial

1385

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

PHP and Python: Code Examples and Comparison Apr 15, 2025 am 12:07 AM

PHP and Python have their own advantages and disadvantages, and the choice depends on project needs and personal preferences. 1.PHP is suitable for rapid development and maintenance of large-scale web applications. 2. Python dominates the field of data science and machine learning.

Python vs. JavaScript: Community, Libraries, and Resources Apr 15, 2025 am 12:16 AM

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.

Detailed explanation of docker principle Apr 14, 2025 pm 11:57 PM

Docker uses Linux kernel features to provide an efficient and isolated application running environment. Its working principle is as follows: 1. The mirror is used as a read-only template, which contains everything you need to run the application; 2. The Union File System (UnionFS) stacks multiple file systems, only storing the differences, saving space and speeding up; 3. The daemon manages the mirrors and containers, and the client uses them for interaction; 4. Namespaces and cgroups implement container isolation and resource limitations; 5. Multiple network modes support container interconnection. Only by understanding these core concepts can you better utilize Docker.

How to run programs in terminal vscode Apr 15, 2025 pm 06:42 PM

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

Python: Automation, Scripting, and Task Management Apr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

Is the vscode extension malicious? Apr 15, 2025 pm 07:57 PM

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

How to install nginx in centos Apr 14, 2025 pm 08:06 PM

CentOS Installing Nginx requires following the following steps: Installing dependencies such as development tools, pcre-devel, and openssl-devel. Download the Nginx source code package, unzip it and compile and install it, and specify the installation path as /usr/local/nginx. Create Nginx users and user groups and set permissions. Modify the configuration file nginx.conf, and configure the listening port and domain name/IP address. Start the Nginx service. Common errors need to be paid attention to, such as dependency issues, port conflicts, and configuration file errors. Performance optimization needs to be adjusted according to the specific situation, such as turning on cache and adjusting the number of worker processes.

What is vscode What is vscode for? Apr 15, 2025 pm 06:45 PM

VS Code is the full name Visual Studio Code, which is a free and open source cross-platform code editor and development environment developed by Microsoft. It supports a wide range of programming languages and provides syntax highlighting, code automatic completion, code snippets and smart prompts to improve development efficiency. Through a rich extension ecosystem, users can add extensions to specific needs and languages, such as debuggers, code formatting tools, and Git integrations. VS Code also includes an intuitive debugger that helps quickly find and resolve bugs in your code.

See all articles