How to improve the parsing efficiency of python crawler?
世界只因有你
世界只因有你 2017-06-12 09:20:36
0
3
782

Now we use multi-threaded crawling in the windows environment,
use beautifulsoup lxml for parsing.

N crawling threads->parsing queue->1 parsing thread->storage queue->1 storage thread

The efficiency of the entire execution program is stuck in the computationally intensive parsing threads. If you only increase the number of parsing threads, it will increase the thread switching overhead and slow down the speed.

Is there any way to significantly improve the parsing efficiency?

According to the instructions of the two thighs, prepare to use
Asynchronous crawling->Parsing queue->N parsing processes->Storage queue->Storage thread

Ready to start work

世界只因有你
世界只因有你

reply all(3)
为情所困

In fact, I think that the N crawling threads you have in front of you can be replaced by coroutine/thread pool, because you are saving a performance cost by frequently creating threads. Although using a thread pool can reduce this part of the loss, But context switching is still unavoidable, so coroutines should be more appropriate.
1 parsing thread is replaced by process pool, and a few more processes are opened for computationally intensive processing. The rest should not need to be changed. If I still want to do it again and rewrite the core part in c/c++. I hope it can help you

刘奇

My approach is multi-process. The advantage of multi-process is that when the performance of a single machine is not enough, you can switch to a distributed crawler at any time.

淡淡烟草味

You can find tornade asynchronous crawler online, I am using this

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template