此前一直是做PHP开发的,现在想学习下爬虫开发,很疑惑呀不知道从何做起,请大家指教下学习线路,我是属于想要深入研究型的。网上看到很多示例感觉就像做采集,Url扩散爬去和分析部分的资料很少...求推荐学习线路、数据、视频等各种,能介绍下避坑攻略就更好啦。
学习是最好的投资!
Having done web development, I think making a crawler is very simple. Just make sure that this is the http protocol and it will be ok
Just tell me a few points
Crawling speed (control vs. speed trade-off)
Multi-threading
Multiple processes
Message Queue
Web page analysis
Interface discovery-> Make good use of F12.Network
xpath, re and other parsing libraries
Structured data
Persistence->Database connection pool->Enable database connections to a certain number
Anti-crawler
Ban IP->Proxy Pool->How to use proxy more rationally
Verification code->OCR
You can first use PHP to implement the crawler and understand the principles. Curl can also do it, language is just a tool
Read a book called "Python Web Crawler".
Having done web development, I think making a crawler is very simple. Just make sure that this is the http protocol and it will be ok
Just tell me a few points
Crawling speed (control vs. speed trade-off)
Multi-threading
Multiple processes
Message Queue
Web page analysis
Interface discovery-> Make good use of F12.Network
xpath, re and other parsing libraries
Structured data
Persistence->Database connection pool->Enable database connections to a certain number
Anti-crawler
Ban IP->Proxy Pool->How to use proxy more rationally
Verification code->OCR
You can first use PHP to implement the crawler and understand the principles. Curl can also do it, language is just a tool
Read a book called "Python Web Crawler".