Home Backend Development Python Tutorial Python method to crawl APP download link

Python method to crawl APP download link

Feb 24, 2017 pm 03:07 PM

First is the preparation work

Python 2.7.11: Download python

Pycharm: Download Pycharm

where python2 and python3 are currently Synchronous release, I am using python2 as the environment here. Pycharm is a relatively efficient Python IDE, but it requires payment.

Basic ideas for implementation

First of all, our target website: Android market

Click [App] to enter our Key page:

Python method to crawl APP download link

After jumping to the application interface, we need to pay attention to three places. The red box in the picture below indicates:

Python method to crawl APP download link

First pay attention to the URL in the address bar, then pay attention to the free download button, and then pay attention to the page turning options at the bottom. Clicking the "Free Download" button will immediately download the corresponding APP, so our idea is to get the click-to-download link and download the APP directly.

Writing a crawler

The first point that needs to be solved: How do we get the download link mentioned above? Here I have to introduce the basic principles of how browsers display web pages. To put it simply, the browser is a tool similar to a parser. When it gets HTML and other codes, it will parse and render according to the corresponding rules, so that we can see the page.

I am using Google Chrome here. Right-click on the page and click "Inspect" to see the original HTML code of the webpage:

Python method to crawl APP download link

Look Don’t worry if you encounter dazzling HTML codes. Google Chrome’s review element has a useful little function that can help us locate the HTML codes corresponding to page controls.

Location:

Python method to crawl APP download link

As shown in the picture above, click the small arrow in the rectangular box above, click the corresponding position on the page, and the HTML code on the right will be automatically positioned and highlighted.

Next we locate the HTML code corresponding to the download button:

Python method to crawl APP download link

You can see that in the code corresponding to the button, there is a corresponding download link: [/ appdown/com.tecent.mm], plus the prefix, the complete download link is http://apk.hiapk.com/appdown/com.tecent.mm

First use python Getting the HTML of the entire page is very simple, just use "requests.get(url) " and fill in the corresponding URL.

Python method to crawl APP download link

Next, when crawling the key information of the page, adopt the idea of ​​​​"grab the big ones first, then the small ones". You can see that there are 10 APPs on a page, corresponding to 10 items in the HTML code:

Python method to crawl APP download link

And each li tag contains the attributes (name) of the respective APP. , download link, etc.). So in the first step, we extract these 10 li tags:


1

2

3

4

def geteveryapp(self,source):

  everyapp = re.findall(&#39;(<li class="list_item".*?</li>)&#39;,source,re.S)

  #everyapp2 = re.findall(&#39;(<p class="button_bg button_1 right_mt">.*?</p>)&#39;,everyapp,re.S)

  return everyapp

Copy after login


Simple regular expression knowledge is used here

Extract the download link in the li tag:


1

2

3

4

5

6

7

8

def getinfo(self,eachclass):

  info = {}

  str1 = str(re.search(&#39;<a href="(.*?)">&#39;, eachclass).group(0))

  app_url = re.search(&#39;"(.*?)"&#39;, str1).group(1)

  appdown_url = app_url.replace(&#39;appinfo&#39;, &#39;appdown&#39;)

  info[&#39;app_url&#39;] = appdown_url

  print appdown_url

  return info

Copy after login


The next difficulty is turning pages , after clicking the page turning button below, we can see that the address bar has changed as follows:

Python method to crawl APP download link

Python method to crawl APP download link

豁然开朗,我们可以在每次的请求中替换URL中对应的id值实现翻页。


1

2

3

4

5

6

7

def changepage(self,url,total_page):

  now_page = int(re.search(&#39;pi=(\d)&#39;, url).group(1))

  page_group = []

  for i in range(now_page,total_page+1):

   link = re.sub(&#39;pi=\d&#39;,&#39;pi=%s&#39;%i,url,re.S)

   page_group.append(link)

  return page_group

Copy after login


爬虫效果

关键位置说完了,我们先看下最后爬虫的效果:

Python method to crawl APP download link

在TXT文件中保存结果如下:

Python method to crawl APP download link

直接复制进迅雷就可以批量高速下载了。

附上全部代码


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

#-*_coding:utf8-*-

import requests

import re

import sys

reload(sys)

sys.setdefaultencoding("utf-8")

 

class spider(object):

 def __init__(self):

  print u'开始爬取内容'

 def getsource(self,url):

  html = requests.get(url)

  return html.text

 

 def changepage(self,url,total_page):

  now_page = int(re.search(&#39;pi=(\d)&#39;, url).group(1))

  page_group = []

  for i in range(now_page,total_page+1):

   link = re.sub(&#39;pi=\d&#39;,&#39;pi=%s&#39;%i,url,re.S)

   page_group.append(link)

  return page_group

 

 def geteveryapp(self,source):

  everyapp = re.findall('(<li class="list_item" .*?<="" li="">)',source,re.S)

  return everyapp

 

 def getinfo(self,eachclass):

  info = {}

  str1 = str(re.search(&#39;<a href="(.*?)">&#39;, eachclass).group(0))

  app_url = re.search(&#39;"(.*?)"&#39;, str1).group(1)

  appdown_url = app_url.replace(&#39;appinfo&#39;, &#39;appdown&#39;)

  info[&#39;app_url&#39;] = appdown_url

  print appdown_url

  return info

 

 def saveinfo(self,classinfo):

  f = open('info.txt','a')

  str2 = "http://apk.hiapk.com"

  for each in classinfo:

   f.write(str2)

   f.writelines(each['app_url'] + '\n')

  f.close()

 

if __name__ == '__main__':

 

 appinfo = []

 url = 'http://apk.hiapk.com/apps/MediaAndVideo?sort=5&pi=1'

 appurl = spider()

 all_links = appurl.changepage(url, 5)

 for link in all_links:

  print u'正在处理页面' + link

  html = appurl.getsource(link)

  every_app = appurl.geteveryapp(html)

  for each in every_app:

   info = appurl.getinfo(each)

   appinfo.append(info)

 appurl.saveinfo(appinfo)</li>

Copy after login


总结

选取的目标网页相对结构清晰简单,这是一个比较基本的爬虫。代码写的比较乱请见谅,以上就是这篇文章的全部内容了,希望能对大家的学习或者工作带来一定的帮助,如果有问题大家可以留言交流。

更多Python method to crawl APP download link相关文章请关注PHP中文网!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to solve the permissions problem encountered when viewing Python version in Linux terminal? How to solve the permissions problem encountered when viewing Python version in Linux terminal? Apr 01, 2025 pm 05:09 PM

Solution to permission issues when viewing Python version in Linux terminal When you try to view Python version in Linux terminal, enter python...

How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? How to efficiently copy the entire column of one DataFrame into another DataFrame with different structures in Python? Apr 01, 2025 pm 11:15 PM

When using Python's pandas library, how to copy whole columns between two DataFrames with different structures is a common problem. Suppose we have two Dats...

How to teach computer novice programming basics in project and problem-driven methods within 10 hours? How to teach computer novice programming basics in project and problem-driven methods within 10 hours? Apr 02, 2025 am 07:18 AM

How to teach computer novice programming basics within 10 hours? If you only have 10 hours to teach computer novice some programming knowledge, what would you choose to teach...

What are regular expressions? What are regular expressions? Mar 20, 2025 pm 06:25 PM

Regular expressions are powerful tools for pattern matching and text manipulation in programming, enhancing efficiency in text processing across various applications.

How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? How to avoid being detected by the browser when using Fiddler Everywhere for man-in-the-middle reading? Apr 02, 2025 am 07:15 AM

How to avoid being detected when using FiddlerEverywhere for man-in-the-middle readings When you use FiddlerEverywhere...

How does Uvicorn continuously listen for HTTP requests without serving_forever()? How does Uvicorn continuously listen for HTTP requests without serving_forever()? Apr 01, 2025 pm 10:51 PM

How does Uvicorn continuously listen for HTTP requests? Uvicorn is a lightweight web server based on ASGI. One of its core functions is to listen for HTTP requests and proceed...

What are some popular Python libraries and their uses? What are some popular Python libraries and their uses? Mar 21, 2025 pm 06:46 PM

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

How to dynamically create an object through a string and call its methods in Python? How to dynamically create an object through a string and call its methods in Python? Apr 01, 2025 pm 11:18 PM

In Python, how to dynamically create an object through a string and call its methods? This is a common programming requirement, especially if it needs to be configured or run...

See all articles