如果将mutiple.py的第54行改为t.daemon=False,那么所有图片下载完成后,程序会一直卡在这里,不会退出。
$ python mutiple.py
一共下载了 253 张图片
Took 57.710124015808105s
...现在卡死不动了,只能通过kill -9来杀
接下来我用$ pstree -h | grep python,显然主线程和它的子线程现在没有退出,这是为什么呢?因为Queue已经设置了join(),而且print语句也成功打印出来,所以说子线程应该已经完工了呀。
python(6591)-+-{python}(6596)
|-{python}(6597)
|-{python}(6598)
|-{python}(6599)
|-{python}(6600)
|-{python}(6601)
|-{python}(6602)
'-{python}(6603)
mutiple.py的代码
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Queue import Queue
from threading import Thread
from time import time
from itertools import chain
from download import setup_download_dir, get_links, download_link
class DownloadWorker(Thread):
def __init__(self, queue):
Thread.__init__(self)
self.queue = queue
def run(self):
while True:
# Get the work from the queue and expand the tuple
item = self.queue.get()
if item is None:
break
directory, link = item
download_link(directory, link)
self.queue.task_done()
def main():
ts = time()
url1 = 'http://www.toutiao.com/a6333981316853907714'
url2 = 'http://www.toutiao.com/a6334459308533350658'
url3 = 'http://www.toutiao.com/a6313664289211924737'
url4 = 'http://www.toutiao.com/a6334337170774458625'
url5 = 'http://www.toutiao.com/a6334486705982996738'
download_dir = setup_download_dir('thread_imgs')
# Create a queue to communicate with the worker threads
queue = Queue()
links = list(chain(
get_links(url1),
get_links(url2),
get_links(url3),
get_links(url4),
get_links(url5),
))
# Create 8 worker threads
for x in range(8):
worker = DownloadWorker(queue)
# Setting daemon to True will let the main thread exit even though the
# workers are blocking
worker.daemon = True
worker.start()
# Put the tasks into the queue as a tuple
for link in links:
queue.put((download_dir, link))
# Causes the main thread to wait for the queue to finish processing all
# the tasks
queue.join()
print u'一共下载了 {} 张图片'.format(len(links))
print u'Took {}s'.format(time() - ts)
if __name__ == '__main__':
main()
"""
一共下载了 253 张图片
Took 57.710124015808105s
"""
download.py的代码
#!/usr/bin/env python
import os
import requests
from pathlib import Path
from bs4 import BeautifulSoup
def get_links(url):
'''
return the links in a list
'''
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
return [img.attrs.get('src') for img in
soup.find_all('p', class_='img-wrap')
if img.attrs.get('src') is not None]
def download_link(directory, link):
'''
download the img by the link and save it
'''
img_name = '{}.jpg'.format(os.path.basename(link))
download_path = directory / img_name
r = requests.get(link)
with download_path.open('wb') as fd:
fd.write(r.content)
def setup_download_dir(directory):
'''
set the dir and create a new dir if not exists
'''
download_dir = Path(directory)
if not download_dir.exists():
download_dir.mkdir()
return download_dir
程序运行中,执行一个主线程,如果主线程又创建一个子线程,主线程和子线程就分兵两路,分别运行,那么当主线程完成想退出时,会检验子线程是否完成。如果子线程未完成,则主线程会等待子线程完成后再退出。但是有时候我们需要的是,只要主线程完成了,不管子线程是否完成,都要和主线程一起退出,这时就可以用setDaemon(True)方法了。
私の理解は次のとおりです:
setdaemon(True) はデーモン スレッドを意味します。つまり、これを True に設定すると、メイン スレッドが終了すると、子スレッドが強制的に終了します。
queue.join() を使用すると、メインスレッドは実行を続行する前に、すべてのサブスレッドが完了するまで待機します。
スレッドには終了関数がありません
上記の3点をまとめると、setdaemon(False)を使用すると、メインスレッドは子スレッドの終了を待つことになります。とても行き詰まっています