I now need to automatically collect data from the article list of a website and the actual content in the list. The id of each article can be obtained in the list, and each article is passed through a unified interface (the parameter brings the article id that is The corresponding json can be obtained) and there is some data that needs to be collected and then analyzed.
Are there any relatively mature frameworks or wheels that can meet my needs? (It needs to be multi-threaded and can run stably 24/7 because the number of collections is huge)
In addition, I would like to ask how to store the collected content (millions to tens of millions). There is some numerical data in the data that needs statistical analysis. Can I use mysql? Or are there other more mature and simple wheels that can be used?
I now need to automatically collect data from the article list of a website and the actual content in the list. The id of each article can be obtained in the list, and each article is passed through a unified interface (the parameter brings the article id that is The corresponding json can be obtained) and there is some data that needs to be collected and then analyzed.
Are there any relatively mature frameworks or wheels that can meet my needs? (It needs to be multi-threaded and can run stably 24/7 because the number of collections is huge)
In addition, I would like to ask how to store the collected content (millions to tens of millions). There is some numerical data in the data that needs statistical analysis. Can I use mysql? Or are there other more mature and simple wheels that can be used?
If it is data analysis.
map-reduce does log analysis
Dpark can solve PV and UV analysis
Spark is also good.
After producing the data report, you can use Pandas for analysis and display. .
If it is data collection. There are many tools.
Why do I think you want to start a search engine? . . The quantity is relatively large. Distributed stuff is recommended.
It is not practical to use MYSQL. . .
Young man, isn’t this what you want from a reptile?
Crawler framework: scrapy
Database selection: You can use MySQL to index at your level for another 500 years
You can also try MongoDB
You didn’t say anything about the language or environment. For multi-threading, nodejs and python are currently generally used. Both of these can use mysql and the like to store data. Millions or tens of millions is not a problem.
Have you ever played with python selenium + PhantomJs?
This is scrapy in python language or this is