Python learns requests to crawl web images
免费学习推荐:python视频教程
前言
最近想做一个练练的小游戏给家里的小朋友玩儿,但是苦于选取素材,然后在一个巨佬的博客里找了灵感,就借用一下粉丝的头像试试爬取网页里的图片吧!(感谢各位啦!)
完成总目标:
爬取粉丝头像作为素材,完成一个连连看的小游戏
故本文分为两部分内容:
1、爬取素材部分;
2、利用素材完成连连看小游戏部分(链接)
实现目标:
通过爬虫实现对粉丝头像的爬取并顺序排列存储,作为游戏设计的素材,其中爬取的头像一部分是使用了CSDN的默认头像,存在重复情况,所以还需要去重以得到完整且不重复的图像集
一、准备
1、python环境
2、涉及到的python库需要 pip install 包名
安装
二、代码编写
1.爬取内容
(1)所需要的库
import requestsimport json
(2)得到请求地址
url = 'https://blog.csdn.net//phoenix/web/v1/fans/list?page=1&pageSize=40&blogUsername=weixin_45386875' #关注我的部分请求地址#url = 'https://blog.csdn.net//phoenix/web/v1/follow/list?page=1&pageSize=40&blogUsername=weixin_45386875' #我关注的部分请求地址
请求地址获取方法:
右击所要爬取部分页面,点击 审查元素,找到图中文件
注: 页面选到“TA的粉丝(13)”部分才能出现,如果点击Network什么也没有,则需要刷新页面就会出现页面内容了)
(3)带上请求头发送请求,做一些简单伪装
header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Cookie' : 'uuid_tt_dd=10_30826311340-1612520858912-361156; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_30826311340-1612520858912-361156!5744*1*weixin_45386875!1788*1*PC_VC; UN=weixin_45386875; p_uid=U010000; ssxmod_itna=Qui=DKiI3hkbG=DXDnD+r8h9eD53ecxPPit5bP1ODlOaYxA5D8D6DQeGTbcW1AoWGATqFYKmEWiH5/gbO4FpjQGcxLbbYx0aDbqGkqnU40rDzuQGuD0KGRDD4GEDCHHDIxGUBuDeKDR8qDg7gQcCM=DbSdDKqDHR+4FbG4oD8PdS0p=C7Gt3AuQ3DmtSije3r424rQ+iPqWzPDA3DK6jiD==; ssxmod_itna2=Qui=DKiI3hkbG=DXDnD+r8h9eD53ecxPPit5bP1D66Ii40vah303pFcXW0D6QALW==0tWLGtzKPUA76xoU10vpqD6AoOqs1R=Db=3olozYp0wVxUS0r/GeZCqzVezFQc8dZon7efYcr=1nhNr6nWKcTqqaDQYcwYSA+hNaqem=WWuDuDQ/+1PGEsN=atvS7WDp07vFuFDherg0AP0KFw0ea6kcTtK2rh/fy=/De0n1FNk+ONYxCXr=QrdTj6gxCuNNWXvp1IDdl2Ckjc=N/cqV6SmHZIZIuOEqml=dHMroMFDvdMVr8afnyQ+sbGPCbt3xdD07tAdRD7uDQ0gT=Bh7OYblbtYQFDDLxD2tGDD===; UserName=weixin_45386875; UserInfo=9863b829527c49a3ba1622396deaa7d9; UserToken=9863b829527c49a3ba1622396deaa7d9; UserNick=ryc875327878; AU=01F; BT=1612846374580; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22uid_%22%3A%7B%22value%22%3A%22weixin_45386875%22%2C%22scope%22%3A1%7D%2C%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%7D; __gads=ID=94978f740e79c9e5-22c918ed05c600ea:T=1613266189:RT=1613266189:S=ALNI_Mbwb8ad5kdYjogF7yImerVAzKaJuQ; dc_session_id=10_1613272889543.735028; announcement-new=%7B%22isLogin%22%3Atrue%2C%22announcementUrl%22%3A%22https%3A%2F%2Fblog.csdn.net%2Fblogdevteam%2Farticle%2Fdetails%2F112280974%3Futm_source%3Dgonggao_0107%22%2C%22announcementCount%22%3A0%2C%22announcementExpire%22%3A3600000%7D; dc_sid=3784575ebe1e9d08a29b0e3fc3621328; c_first_ref=default; c_first_page=https%3A//www.csdn.net/; c_segment=4; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1613222907,1613266055,1613268241,1613273899; TY_SESSION_ID=82f0aa61-9b28-49b2-a854-b18414426735; c_pref=; c_ref=https%3A//www.csdn.net/; c_page_id=default; dc_tos=qoi2fq; log_Id_pv=925; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1613274327; log_Id_view=905; log_Id_click=658' }
获取请求头的方法:
(4)向网页发送请求
try: data = requests.get(url,headers = header).text data_dist = json.loads(data) except: print('爬取失败') exit ()
2.保存所需图片并去重
定义一些函数备用
#保存文件def save_imag(file_name,img_url): request.urlretrieve(url=img_url,filename='D:\\ryc\python_learning\\10_linkup\\fan_avatar\\'+file_name)
#删除一个文件夹下的所有所有文件def del_file(path): ls = os.listdir(path) for i in ls: c_path = os.path.join(path, i) if os.path.isdir(c_path):#如果是文件夹那么递归调用一下 del_file(c_path) else: #如果是一个文件那么直接删除 os.remove(c_path) print ('文件已经清空完成')
#图像去重def compare_images(pic1,pic2): image1 = Image.open(pic1) image2 = Image.open(pic2) histogram1 = image1.histogram() histogram2 = image2.histogram() differ = math.sqrt(reduce(operator.add, list(map(lambda a,b: (a-b)**2,histogram1, histogram2)))/len(histogram1)) print('differ:',differ) if differ == 0: return 'same' else: return 'diff'
#删除指定位置的图像def del_avatar(path): if os.path.exists(path): # 如果文件存在 os.remove(path) else: print('no such file:%s'%(path)) # 则返回文件不存在
#先清空一下文件夹 del_file('D:\\ryc\python_learning\\10_linkup\\fan_avatar') index = 0 # i 是爬取列表的索引号; index 是保存的图片的索引号 for i in range(0,len(fan_list)): fan_username = fan_list[i]['nickname'] #print('fans_user%s:'%(i+1),fan_username) fan_avatar_url = fan_list[i]['userAvatar'] #print('fans_avatar_url%s:'%(i+1),fan_avatar_url) save_imag('fans_avatar%s.jpg'%(index+1),fan_avatar_url) #print('----------------save_image--fans_avatar%s.jpg'%(index+1)) #图片去重 for j in range(0,index): if index != j : comp_res = compare_images('./fan_avatar/fans_avatar%s.jpg'%(index+1),'./fan_avatar/fans_avatar%s.jpg'%(j+1)) print('--------compare_images:--------'+'./fan_avatar/fans_avatar%s.jpg'%(index+1) + '------with---' + './fan_avatar/fans_avatar%s.jpg'%(j+1)) print('comp_res:',comp_res) if comp_res == 'same': del_avatar('D:\\ryc\python_learning\\10_linkup\\fan_avatar\\fans_avatar%s.jpg'%(index+1)) print('D:\\ryc\python_learning\\10_linkup\\fan_avatar\\fans_avatar%s.jpg'%(index+1)) index = index - 1 break index = index + 1
3、调用
if __name__ == "__main__": spider_fanavatar()
三、完整代码
# 爬取网页图片import requestsfrom urllib import requestimport jsonfrom PIL import Imageimport osimport mathimport operatorfrom functools import reduce#保存文件def save_imag(file_name,img_url): request.urlretrieve(url=img_url,filename='D:\\ryc\python_learning\\10_linkup\\fan_avatar\\'+file_name)#爬取粉丝的头像def spider_fanavatar(): url = 'https://blog.csdn.net//phoenix/web/v1/fans/list?page=1&pageSize=40&blogUsername=weixin_45386875' #关注我的部分请求地址 #url = 'https://blog.csdn.net//phoenix/web/v1/follow/list?page=1&pageSize=40&blogUsername=weixin_45386875' #我关注的部分请求地址 header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Cookie' : 'uuid_tt_dd=10_30826311340-1612520858912-361156; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_30826311340-1612520858912-361156!5744*1*weixin_45386875!1788*1*PC_VC; UN=weixin_45386875; p_uid=U010000; ssxmod_itna=Qui=DKiI3hkbG=DXDnD+r8h9eD53ecxPPit5bP1ODlOaYxA5D8D6DQeGTbcW1AoWGATqFYKmEWiH5/gbO4FpjQGcxLbbYx0aDbqGkqnU40rDzuQGuD0KGRDD4GEDCHHDIxGUBuDeKDR8qDg7gQcCM=DbSdDKqDHR+4FbG4oD8PdS0p=C7Gt3AuQ3DmtSije3r424rQ+iPqWzPDA3DK6jiD==; ssxmod_itna2=Qui=DKiI3hkbG=DXDnD+r8h9eD53ecxPPit5bP1D66Ii40vah303pFcXW0D6QALW==0tWLGtzKPUA76xoU10vpqD6AoOqs1R=Db=3olozYp0wVxUS0r/GeZCqzVezFQc8dZon7efYcr=1nhNr6nWKcTqqaDQYcwYSA+hNaqem=WWuDuDQ/+1PGEsN=atvS7WDp07vFuFDherg0AP0KFw0ea6kcTtK2rh/fy=/De0n1FNk+ONYxCXr=QrdTj6gxCuNNWXvp1IDdl2Ckjc=N/cqV6SmHZIZIuOEqml=dHMroMFDvdMVr8afnyQ+sbGPCbt3xdD07tAdRD7uDQ0gT=Bh7OYblbtYQFDDLxD2tGDD===; UserName=weixin_45386875; UserInfo=9863b829527c49a3ba1622396deaa7d9; UserToken=9863b829527c49a3ba1622396deaa7d9; UserNick=ryc875327878; AU=01F; BT=1612846374580; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22uid_%22%3A%7B%22value%22%3A%22weixin_45386875%22%2C%22scope%22%3A1%7D%2C%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%7D; __gads=ID=94978f740e79c9e5-22c918ed05c600ea:T=1613266189:RT=1613266189:S=ALNI_Mbwb8ad5kdYjogF7yImerVAzKaJuQ; dc_session_id=10_1613272889543.735028; announcement-new=%7B%22isLogin%22%3Atrue%2C%22announcementUrl%22%3A%22https%3A%2F%2Fblog.csdn.net%2Fblogdevteam%2Farticle%2Fdetails%2F112280974%3Futm_source%3Dgonggao_0107%22%2C%22announcementCount%22%3A0%2C%22announcementExpire%22%3A3600000%7D; dc_sid=3784575ebe1e9d08a29b0e3fc3621328; c_first_ref=default; c_first_page=https%3A//www.csdn.net/; c_segment=4; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1613222907,1613266055,1613268241,1613273899; TY_SESSION_ID=82f0aa61-9b28-49b2-a854-b18414426735; c_pref=; c_ref=https%3A//www.csdn.net/; c_page_id=default; dc_tos=qoi2fq; log_Id_pv=925; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1613274327; log_Id_view=905; log_Id_click=658' } try: data = requests.get(url,headers = header).text #得到返回的字符串 data_dist = json.loads(data) #将字符串转为字典格式 except: print('爬取失败') exit () fan_list = data_dist['data']['list'] #提取所需内容 #先清空一下文件夹 del_file('D:\\ryc\python_learning\\10_linkup\\fan_avatar') index = 0 # i 是爬取列表的索引号; index 是保存的图片的索引号 for i in range(0,len(fan_list)): fan_username = fan_list[i]['nickname'] #print('fans_user%s:'%(i+1),fan_username) fan_avatar_url = fan_list[i]['userAvatar'] #print('fans_avatar_url%s:'%(i+1),fan_avatar_url) save_imag('fans_avatar%s.jpg'%(index+1),fan_avatar_url) #print('----------------save_image--fans_avatar%s.jpg'%(index+1)) #图片去重 for j in range(0,index): if index != j : comp_res = compare_images('./fan_avatar/fans_avatar%s.jpg'%(index+1),'./fan_avatar/fans_avatar%s.jpg'%(j+1)) print('--------compare_images:--------'+'./fan_avatar/fans_avatar%s.jpg'%(index+1) + '------with---' + './fan_avatar/fans_avatar%s.jpg'%(j+1)) print('comp_res:',comp_res) if comp_res == 'same': del_avatar('D:\\ryc\python_learning\\10_linkup\\fan_avatar\\fans_avatar%s.jpg'%(index+1)) print('D:\\ryc\python_learning\\10_linkup\\fan_avatar\\fans_avatar%s.jpg'%(index+1)) index = index - 1 break index = index + 1 #图像去重def compare_images(pic1,pic2): image1 = Image.open(pic1) image2 = Image.open(pic2) histogram1 = image1.histogram() histogram2 = image2.histogram() differ = math.sqrt(reduce(operator.add, list(map(lambda a,b: (a-b)**2,histogram1, histogram2)))/len(histogram1)) print('differ:',differ) if differ == 0: return 'same' else: return 'diff' #删除指定位置的图像def del_avatar(path): if os.path.exists(path): # 如果文件存在 os.remove(path) else: print('no such file:%s'%(path)) # 则返回文件不存在#删除一个文件夹下的所有所有文件def del_file(path): ls = os.listdir(path) for i in ls: c_path = os.path.join(path, i) if os.path.isdir(c_path):#如果是文件夹那么递归调用一下 del_file(c_path) else: #如果是一个文件那么直接删除 os.remove(c_path) print ('文件已经清空完成')if __name__ == "__main__": spider_fanavatar()
最后
第一部分内容就到这里,第二部分内容将在下一篇文章完成,感兴趣的小伙伴可以关注我,然后去看下一片文章哦!
都读到这里了,各位亲爱的读者留下你们宝贵的赞和评论吧,这将是我继续前进的坚定动力!!!
大量免费学习推荐,敬请访问python教程(视频)
The above is the detailed content of Python learns requests to crawl web images. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

VS Code is the full name Visual Studio Code, which is a free and open source cross-platform code editor and development environment developed by Microsoft. It supports a wide range of programming languages and provides syntax highlighting, code automatic completion, code snippets and smart prompts to improve development efficiency. Through a rich extension ecosystem, users can add extensions to specific needs and languages, such as debuggers, code formatting tools, and Git integrations. VS Code also includes an intuitive debugger that helps quickly find and resolve bugs in your code.

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

VS Code not only can run Python, but also provides powerful functions, including: automatically identifying Python files after installing Python extensions, providing functions such as code completion, syntax highlighting, and debugging. Relying on the installed Python environment, extensions act as bridge connection editing and Python environment. The debugging functions include setting breakpoints, step-by-step debugging, viewing variable values, and improving debugging efficiency. The integrated terminal supports running complex commands such as unit testing and package management. Supports extended configuration and enhances features such as code formatting, analysis and version control.

Yes, VS Code can run Python code. To run Python efficiently in VS Code, complete the following steps: Install the Python interpreter and configure environment variables. Install the Python extension in VS Code. Run Python code in VS Code's terminal via the command line. Use VS Code's debugging capabilities and code formatting to improve development efficiency. Adopt good programming habits and use performance analysis tools to optimize code performance.
