How many threads are appropriate for a python crawler?

anonymity
Release: 2019-06-14 09:53:25
Original
4190 people have browsed it

Recently I am planning to crawl the data of an e-commerce website. Let’s not consider the agency and distribution. Let’s talk about efficiency first (of course, if you request too quickly, you will be blocked), let’s get to the point. Under normal circumstances, the first thing we think of as novices is the for loop, which is single-threaded. Then we consider the for loop to directly open five threads. The problem is that if a URL request has not come back, the rest will just wait. It is useless to use multiple threads in this way.

How many threads are appropriate for a python crawler?

Performance considerations

If we are sure to use multi-threading or multi-process, then should we use multi-threading or multi-process? Some people have certain prejudices against multi-processing and multi-threading, just because of python's GIL lock. Let's talk about the difference between these two things.

Multi-threading

Generally, when we start a .py file, it is equivalent to starting a process. There is one thread working by default in a process. We use many Threading means enabling multiple threads in a process.

But here comes the question, Why use multi-threading?

I know that when starting a process, you need to create some memory space, which is equivalent to a house. We have to work in this house. You can think of a person as a thread. Your house There is a difference between a space for 10 people and a space for 20 people under normal circumstances, because we know that threads can communicate by default (processes cannot communicate by default, but it can be achieved using technology) , such as pipes). Multi-threading can be used to ensure the correctness of calculated data, so a GIL lock appears to ensure that only one thread can be calculating at the same time.

You can basically understand the GIL lock as, for example, if there is an account to be settled in this room, only one person can be calculating the account at the same time. Think of a question, if there are 5 people in this account, If I can calculate it clearly, I only need a room of 10 square meters, so why should I hire 10 people and spend 20 square meters? So it’s not that the more threads you open, the better. But, but, but, please note that everyone does not need to use their brains (CPU calculation) when calculating this account. You can do other things (for example, 5 people divide the work and each calculate a part), such as each recording the results after completing the calculation. In the ledger for later reconciliation, everyone has their own ledger, so multi-threading is suitable for IO operations. Remember, even if it is suitable for IO operations, it does not mean that the more people, the better, so this The amount still depends on the actual situation.

The above is the detailed content of How many threads are appropriate for a python crawler?. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template