[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong University's grade point calculation as an example)-Python Tutorial-php.cn

Home

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong University's grade point calculation as an example)

黄舟

Jan 21, 2017 pm 02:42 PM

Let’s talk about our school’s website first:

http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html

To check scores, you need to log in and then display The results of each subject are displayed, but only the results are displayed without the grade points, which is the weighted average score.

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

Obviously calculating grade points manually is a very troublesome thing. So we can use Python to make a crawler to solve this problem.

1. On the eve of the decisive battle

Let’s prepare a tool first: HttpFox plug-in.

This is an http protocol analysis plug-in that analyzes page request and response time, content, and COOKIE used by the browser.

Take me as an example, just install it on Firefox, the effect is as shown:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

Okay View the corresponding information very intuitively.

Click start to start detection, click stop to pause detection, and click clear to clear the content.

Generally before use, click stop to pause, and then click clear to clear the screen to ensure that you see the data obtained by accessing the current page.

2. Go deep behind enemy lines

Let’s go to Shandong University’s score query website to see what information is sent when logging in.

First go to the login page, open httpfox, after clearing, click start to turn on the detection:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

After entering the personal information, make sure httpfox is turned on. Then click OK to submit the information and log in.

You can see at this time that httpfox has detected three pieces of information:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

At this time, click the stop button to ensure that what is captured is the feedback after visiting the page. Data so that we can simulate login when doing crawlers.

3. Pao Ding Jie Niu

At first glance, we got three data, two are GET and one is POST, but what are they? What and how it should be used, we still have no idea.

So, we need to check the captured content one by one.

Look at the POST information first:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

Since it is the POST information, we can just look at the PostData.

You can see that there are two POST data, studid and pwd.

And it can be seen from the Redirect to of Type that after the POST is completed, it jumps to the bks_login2.loginmessage page.

It can be seen that this data is the form data submitted after clicking OK.

Click on the cookie label to see the cookie information:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

Yes, an ACCOUNT cookie was received and will be automatically destroyed after the session ends.

So what information did you receive after submitting?

Let’s take a look at the next two GET data.

Look at the first one first. We click on the content tag to view the received content. Do you feel like eating it alive? -The HTML source code is undoubtedly exposed:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

It seems that this is just the html source code of the page. Click on the cookie to view the cookie-related information:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

Aha, it turns out that the content of the html page was received only after the cookie information was sent.

Let’s take a look at the last received message:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

After a rough look, it should be just a css file called style.css, which doesn’t mean much to us. big effect.

4. Calmly respond

Now that we know what data we sent to the server and what data we received, the basic process is as follows:

First, we POST the student ID and password--->Then return the cookie value

Then send the cookie to the server--->Return the page information.

Get the data from the grades page, use regular expressions to extract the grades and credits separately and calculate the weighted average.

OK, it looks like a very simple sample paper. Then let’s try it out.

But before the experiment, there is still an unresolved problem, which is where is the POST data sent?

Look at the original page again:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

It is obviously implemented using an html framework, that is to say, what we see in the address bar The address is not the address to submit the form on the right.

So how can I get the real address-. -Right click to view page source code:

Yes, that’s right, the one with name="w_right" is the login page we want.

The original address of the website is:

http://jwxt.sdu.edu.cn:7777/zhxt_bks/zhxt_bks.html

So, the real form submission The address should be:

http://jwxt.sdu.edu.cn:7777/zhxt_bks/xk_login.html

After entering it, it is true:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

It’s actually the course selection system of Tsinghua University. . . My guess is that our school was too lazy to create a page, so we just borrowed it. . As a result, the title was not even changed. . .

But this page is still not the page we need, because the page our POST data is submitted to should be the page submitted in the ACTION of the form.

In other words, we need to check the source code to know where the POST data is sent:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

Well, visually this is the address where the POST data is submitted. .

Arrange it into the address bar. The complete address should be as follows:

http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login

(The method of obtaining it is very simple. Just click on the link directly in Firefox browser to see the address of the link)

5. A little test of my skills

The next task is to use python to simulate sending a POST data and get the returned cookie value.

Regarding the operation of cookies, you can read this blog post:

http://blog.csdn.net/wxg694175346/article/details/8925978

We first prepare a POST data, then prepare a cookie to receive, and then write the source code as follows:

# -*- coding: utf-8 -*-  
#---------------------------------------  
#   程序：山东大学爬虫  
#   版本：0.1  
#   作者：why  
#   日期：2013-07-12  
#   语言：Python 2.7  
#   操作：输入学号和密码  
#   功能：输出成绩的加权平均值也就是绩点  
#---------------------------------------  
  
import urllib    
import urllib2  
import cookielib  
  
cookie = cookielib.CookieJar()    
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))  
  
#需要POST的数据#  
postdata=urllib.urlencode({    
    &#39;stuid&#39;:&#39;201100300428&#39;,    
    &#39;pwd&#39;:&#39;921030&#39;    
})  
  
#自定义一个请求#  
req = urllib2.Request(    
    url = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login&#39;,    
    data = postdata  
)  
  
#访问该链接#  
result = opener.open(req)  
  
#打印返回的内容#  
print result.read()

Copy after login

After this, look at the effect of the operation:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

ok, in this way, we will calculate that the simulated login is successful.

6. Change the situation

The next task is to use a crawler to obtain the students’ scores.

Let’s look at the source website again.

After opening HTTPFOX, click to view the results and find that the following data was captured:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

##Click on the first GET data and check the content. You can find that the Content is the content of the obtained results.

To get the page link, right-click to view the element from the page source code, and you can see the page that jumps after clicking the link (in Firefox, you only need to right-click , "View this frame", that's it):

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

## so you can get The link to view the results is as follows:

http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopere

7. Everything is ready

Now everything is ready, so just apply the link to the crawler and see if you can see the results page.

As you can see from httpfox, we have to send a cookie to return the score information, so we use python to simulate sending a cookie to request the score information:

# -*- coding: utf-8 -*-  
#---------------------------------------  
#   程序：山东大学爬虫  
#   版本：0.1  
#   作者：why  
#   日期：2013-07-12  
#   语言：Python 2.7  
#   操作：输入学号和密码  
#   功能：输出成绩的加权平均值也就是绩点  
#---------------------------------------  
  
import urllib    
import urllib2  
import cookielib  
  
#初始化一个CookieJar来处理Cookie的信息#  
cookie = cookielib.CookieJar()  
  
#创建一个新的opener来使用我们的CookieJar#  
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))  
  
#需要POST的数据#  
postdata=urllib.urlencode({    
    &#39;stuid&#39;:&#39;201100300428&#39;,    
    &#39;pwd&#39;:&#39;921030&#39;    
})  
  
#自定义一个请求#  
req = urllib2.Request(    
    url = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login&#39;,    
    data = postdata  
)  
  
#访问该链接#  
result = opener.open(req)  
  
#打印返回的内容#  
print result.read()  
  
#打印cookie的值  
for item in cookie:    
    print &#39;Cookie：Name = &#39;+item.name    
    print &#39;Cookie：Value = &#39;+item.value  
  
      
#访问该链接#  
result = opener.open(&#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre&#39;)  
  
#打印返回的内容#  
print result.read()

Copy after login

Press F5 to run and take a look at the captured data:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example) Since there is no problem, use regular expressions to convert the data Just process it a little and take out the credits and corresponding scores.

8. At your fingertips

Such a large amount of html source code is obviously not conducive to our processing. Next, we will use regular expressions to extract the necessary data.

For tutorials on regular expressions, you can read this blog post:

http://blog.csdn.net/wxg694175346/article/details/8929576

Let’s take a look at the source code of the score:

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

既然如此，用正则表达式就易如反掌了。

我们将代码稍稍整理一下，然后用正则来取出数据：

# -*- coding: utf-8 -*-  
#---------------------------------------  
#   程序：山东大学爬虫  
#   版本：0.1  
#   作者：why  
#   日期：2013-07-12  
#   语言：Python 2.7  
#   操作：输入学号和密码  
#   功能：输出成绩的加权平均值也就是绩点  
#---------------------------------------  
  
import urllib    
import urllib2  
import cookielib  
import re  
  
class SDU_Spider:    
    # 申明相关的属性    
    def __init__(self):      
        self.loginUrl = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login&#39;   # 登录的url  
        self.resultUrl = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre&#39; # 显示成绩的url  
        self.cookieJar = cookielib.CookieJar()                                      # 初始化一个CookieJar来处理Cookie的信息  
        self.postdata=urllib.urlencode({&#39;stuid&#39;:&#39;201100300428&#39;,&#39;pwd&#39;:&#39;921030&#39;})     # POST的数据  
        self.weights = []   #存储权重，也就是学分  
        self.points = []    #存储分数，也就是成绩  
        self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookieJar))  
  
    def sdu_init(self):  
        # 初始化链接并且获取cookie  
        myRequest = urllib2.Request(url = self.loginUrl,data = self.postdata)   # 自定义一个请求  
        result = self.opener.open(myRequest)            # 访问登录页面，获取到必须的cookie的值  
        result = self.opener.open(self.resultUrl)       # 访问成绩页面，获得成绩的数据  
        # 打印返回的内容  
        # print result.read()  
        self.deal_data(result.read().decode(&#39;gbk&#39;))  
        self.print_data(self.weights);  
        self.print_data(self.points);  
  
    # 将内容从页面代码中抠出来    
    def deal_data(self,myPage):    
        myItems = re.findall(&#39;<TR>.*?<p.*?<p.*?<p.*?<p.*?<p.*?>(.*?)</p>.*?<p.*?<p.*?>(.*?)</p>.*?</TR>&#39;,myPage,re.S)     #获取到学分  
        for item in myItems:  
            self.weights.append(item[0].encode(&#39;gbk&#39;))  
            self.points.append(item[1].encode(&#39;gbk&#39;))  
  
              
    # 将内容从页面代码中抠出来  
    def print_data(self,items):    
        for item in items:    
            print item  
              
#调用    
mySpider = SDU_Spider()    
mySpider.sdu_init()

Copy after login

水平有限，，正则是有点丑，。运行的效果如图：

[Python] Web Crawler (10): The whole process of the birth of a crawler (taking Shandong Universitys grade point calculation as an example)

ok，接下来的只是数据的处理问题了。。

9.凯旋而归

完整的代码如下，至此一个完整的爬虫项目便完工了。

# -*- coding: utf-8 -*-  
#---------------------------------------  
#   程序：山东大学爬虫  
#   版本：0.1  
#   作者：why  
#   日期：2013-07-12  
#   语言：Python 2.7  
#   操作：输入学号和密码  
#   功能：输出成绩的加权平均值也就是绩点  
#---------------------------------------  
  
import urllib    
import urllib2  
import cookielib  
import re  
import string  
  
  
class SDU_Spider:    
    # 申明相关的属性    
    def __init__(self):      
        self.loginUrl = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bks_login2.login&#39;   # 登录的url  
        self.resultUrl = &#39;http://jwxt.sdu.edu.cn:7777/pls/wwwbks/bkscjcx.curscopre&#39; # 显示成绩的url  
        self.cookieJar = cookielib.CookieJar()                                      # 初始化一个CookieJar来处理Cookie的信息  
        self.postdata=urllib.urlencode({&#39;stuid&#39;:&#39;201100300428&#39;,&#39;pwd&#39;:&#39;921030&#39;})     # POST的数据  
        self.weights = []   #存储权重，也就是学分  
        self.points = []    #存储分数，也就是成绩  
        self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookieJar))  
  
    def sdu_init(self):  
        # 初始化链接并且获取cookie  
        myRequest = urllib2.Request(url = self.loginUrl,data = self.postdata)   # 自定义一个请求  
        result = self.opener.open(myRequest)            # 访问登录页面，获取到必须的cookie的值  
        result = self.opener.open(self.resultUrl)       # 访问成绩页面，获得成绩的数据  
        # 打印返回的内容  
        # print result.read()  
        self.deal_data(result.read().decode(&#39;gbk&#39;))  
        self.calculate_date();  
  
    # 将内容从页面代码中抠出来    
    def deal_data(self,myPage):    
        myItems = re.findall(&#39;<TR>.*?<p.*?<p.*?<p.*?<p.*?<p.*?>(.*?)</p>.*?<p.*?<p.*?>(.*?)</p>.*?</TR>&#39;,myPage,re.S)     #获取到学分  
        for item in myItems:  
            self.weights.append(item[0].encode(&#39;gbk&#39;))  
            self.points.append(item[1].encode(&#39;gbk&#39;))  
  
    #计算绩点，如果成绩还没出来，或者成绩是优秀良好，就不运算该成绩  
    def calculate_date(self):  
        point = 0.0  
        weight = 0.0  
        for i in range(len(self.points)):  
            if(self.points[i].isdigit()):  
                point += string.atof(self.points[i])*string.atof(self.weights[i])  
                weight += string.atof(self.weights[i])  
        print point/weight  
  
              
#调用    
mySpider = SDU_Spider()    
mySpider.sdu_init()

Copy after login

以上就是 [Python]网络爬虫（十）：一个爬虫的诞生全过程（以山东大学绩点运算为例）的内容，更多相关内容请关注PHP中文网（www.php.cn）！

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7525

CakePHP Tutorial

1378

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

Python: Games, GUIs, and More Apr 13, 2025 am 12:14 AM

Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

PHP and Python: Comparing Two Popular Programming Languages Apr 14, 2025 am 12:13 AM

PHP and Python each have their own advantages, and choose according to project requirements. 1.PHP is suitable for web development, especially for rapid development and maintenance of websites. 2. Python is suitable for data science, machine learning and artificial intelligence, with concise syntax and suitable for beginners.

How debian readdir integrates with other tools Apr 13, 2025 am 09:42 AM

The readdir function in the Debian system is a system call used to read directory contents and is often used in C programming. This article will explain how to integrate readdir with other tools to enhance its functionality. Method 1: Combining C language program and pipeline First, write a C program to call the readdir function and output the result: #include#include#include#includeintmain(intargc,char*argv[]){DIR*dir;structdirent*entry;if(argc!=2){

Python and Time: Making the Most of Your Study Time Apr 14, 2025 am 12:02 AM

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

Nginx SSL Certificate Update Debian Tutorial Apr 13, 2025 am 07:21 AM

This article will guide you on how to update your NginxSSL certificate on your Debian system. Step 1: Install Certbot First, make sure your system has certbot and python3-certbot-nginx packages installed. If not installed, please execute the following command: sudoapt-getupdatesudoapt-getinstallcertbotpython3-certbot-nginx Step 2: Obtain and configure the certificate Use the certbot command to obtain the Let'sEncrypt certificate and configure Nginx: sudocertbot--nginx Follow the prompts to select

GitLab's plug-in development guide on Debian Apr 13, 2025 am 08:24 AM

Developing a GitLab plugin on Debian requires some specific steps and knowledge. Here is a basic guide to help you get started with this process. Installing GitLab First, you need to install GitLab on your Debian system. You can refer to the official installation manual of GitLab. Get API access token Before performing API integration, you need to get GitLab's API access token first. Open the GitLab dashboard, find the "AccessTokens" option in the user settings, and generate a new access token. Will be generated

How to configure HTTPS server in Debian OpenSSL Apr 13, 2025 am 11:03 AM

Configuring an HTTPS server on a Debian system involves several steps, including installing the necessary software, generating an SSL certificate, and configuring a web server (such as Apache or Nginx) to use an SSL certificate. Here is a basic guide, assuming you are using an ApacheWeb server. 1. Install the necessary software First, make sure your system is up to date and install Apache and OpenSSL: sudoaptupdatesudoaptupgradesudoaptinsta

What service is apache Apr 13, 2025 pm 12:06 PM

Apache is the hero behind the Internet. It is not only a web server, but also a powerful platform that supports huge traffic and provides dynamic content. It provides extremely high flexibility through a modular design, allowing for the expansion of various functions as needed. However, modularity also presents configuration and performance challenges that require careful management. Apache is suitable for server scenarios that require highly customizable and meet complex needs.

See all articles