First of all, we need to know that there are two ways of page data content (there are only 2 ways of data to be crawled): first, directly rendered (based on mvc template assignment to the template page); second, obtained through the interface and then rendered by JS (interface Returned)
Then if you want to find data:
It depends on whether the directly accessed address can get the text with the content you want (based on the mvc template assignment to the template page )
If not, see which interfaces it is obtained through.
The same is true for further links:
If it is rendered directly, it can be obtained through xpath or csspath and other third-party libraries to separate data and tags
If it is not rendered directly, you have to piece together the link (with cookies) based on the parameters generated by JS. Next visit
Note 1: If you cannot get the value many times, you should change the cookie manually
Note 2: If it is an interface, pay attention to the requested URL It has to change every day, because the parameters on the url will change. If you don't modify it, you can't crawl down (you can put the url into the database like this, check it out when crawling, spell the parameters, and throw it to curl)
Note 3: Another thing is that I don’t know about WeChat’s speed limit. If it’s not time-sensitive, just climb one round in about 10 seconds.
The most important thing is that some things are not necessary. You must simulate login first before crawling. After logging in, directly find an interface and run it in the browser to try (if there is data returned, it proves that you only need to bring cookies and parameters required for the request). There is no need to simulate scanning the code.
Related recommendations:
Detailed explanation of using CURL in PHP
PHP’s powerful CURL POST class
PHP curl disguise source information
The above is the detailed content of Detailed explanation of php data crawling curl example. For more information, please follow other related articles on the PHP Chinese website!