python - 一般公司做爬虫采集的话常用什么语言
阿神
阿神 2017-04-17 17:48:02
0
30
1775

一般公司做爬虫采集的话常用什么语言 在京东搜点书全是有关java的

阿神
阿神

闭关修行中......

reply all(30)
左手右手慢动作

You can try the jsoup tool, which is developed using java.

阿神

Let’s start using node now. JavaScript is the one who understands HTML best

Peter_Zhu

nodejs +1

洪涛

nodejs +1

伊谢尔伦

Actually, I don’t quite agree with what the person who did the DHT crawler said.
Different languages ​​will naturally have different uses. Talking about which one is good or bad without the environment is just a hooliganism.
1. If you are doing it for fun, crawling a few pages in a targeted manner, and if efficiency is not the core requirement, the problem will not be big. Any language will work, and the performance difference will not be big. Of course, if you encounter a very complex page and the regular expression is very complex, the maintainability of the crawler will decrease.

2. If you are doing directional crawling, and the target needs to parse dynamic js.
So at this time, using the ordinary method of requesting the page and getting the content will definitely not work. A js engine similar to firfox and chrome is needed to dynamically parse the js code. At this time, we recommend casperJS+phantomjs or slimerJS+phantomjs

3. If it is a large-scale website crawling
At this time, efficiency, scalability, maintainability, etc. must be considered.
Large-scale crawling involves many aspects, such as distributed crawling, heavy judgment mechanism, and task scheduling. Which of these questions is easier if you go deeper?
Language selection is very important at this time.

NodeJs: It is very efficient in crawling. High concurrency, multi-threaded programming becomes simple traversal and callback, memory and CPU usage are small, and callback must be handled well.

PHP: Various frameworks are available everywhere, you can just use any one. However, there is really a problem with the efficiency of PHP...not much to say

Python: I write more in python and have better support for various problems. The scrapy framework is easy to use and has many advantages.

I think js is not very suitable for writing... efficiency issues. If I haven’t written it, I’ll probably be in a lot of trouble.

As far as I know, big companies also use C++. In short, most of them are modified on open source frameworks. Not many people really reinvent the wheel.
not worth.

I wrote it casually based on my impressions. Corrections are welcome.

洪涛

Use pyspider, its performance is no worse than scrapy, more flexible, with WEBUI, and also supports JS crawling~
You can play it with your own demo~

迷茫

selenium

黄舟

nodejs +1

No, I was wrong.


High-performance crawlers are not as suitable for concurrency as servers, but for efficiency (reduce duplication) are more suitable for parallelism rather than concurrency.

Well I was wrong again.


Concurrency and parallelism are almost the same for crawlers~


No, it’s different.

Forget it, nodejs +1.

大家讲道理

Most of them use python, and of course there are also many java c++. Python comes quickly and has great advantages over small and medium-sized applications. If it is large-scale, optimization or C is required to rewrite some performance bottleneck codes.

Peter_Zhu

You can try python’s scrapy

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template