Now we use multi-threaded crawling in the windows environment,
use beautifulsoup lxml for parsing.
N crawling threads->parsing queue->1 parsing thread->storage queue->1 storage thread
The efficiency of the entire execution program is stuck in the computationally intensive parsing threads. If you only increase the number of parsing threads, it will increase the thread switching overhead and slow down the speed.
Is there any way to significantly improve the parsing efficiency?
According to the instructions of the two thighs, prepare to use
Asynchronous crawling->Parsing queue->N parsing processes->Storage queue->Storage thread
Ready to start work
In fact, I think that the
N crawling threads
you have in front of you can be replaced bycoroutine/thread pool
, because you are saving a performance cost by frequently creating threads. Although using a thread pool can reduce this part of the loss, But context switching is still unavoidable, so coroutines should be more appropriate.1 parsing thread
is replaced byprocess pool
, and a few more processes are opened for computationally intensive processing. The rest should not need to be changed. If I still want to do it again and rewrite the core part inc/c++
. I hope it can help youMy approach is multi-process. The advantage of multi-process is that when the performance of a single machine is not enough, you can switch to a distributed crawler at any time.
You can find tornade asynchronous crawler online, I am using this