Due to necessity, I wanted to write a simple PHP collection program. As usual, I went to the Internet to find a bunch of tutorials, and then copied them. However, I found that all the online tutorials were specious, and none of them were really useful. After thinking hard for a few days, I finally figured out the reason behind it. Write it down here and ask experts to correct me.
The idea of the collection program is very simple. It is nothing more than opening a page first, usually a list page, getting the addresses of all the links in it, and then opening the links one by one to look for what we are interested in. If found, put it into the database or other processing. Let's talk about it with a very simple example.
First determine a collection page, usually the list page. The target here is: http://www.BkJia.com/article/11/index.htm. This is a list page, and our purpose is to collect all articles on this list page. There is a list page. The first step is to open it and incorporate its content into our program. Generally, the two functions fopen or file_get_contents are used. We use fopen as an example here. How to open it? Very simple: $source=fopen("[url=http://www.BkJia.com/article/11/index.htm",]http://www.BkJia.com/article/11/index.htm" ,r[/url]); In fact, the content has been incorporated into our program. Note that the $source obtained is a resource, not a processable text, so the function fread is used to read the content into a variable. This time it’s real editable text. Example:
$content=fread($source,99999); The following number indicates the number of bytes, just fill in a large one. You use file_put_contents to write $content to a text file. You can see that the content inside is actually the source code of the web page. After getting the source code of the web page, we have to analyze the article link address inside. Regular expressions are used here. [Recommended regular expression tutorial (http://www.BkJia.com/article/7/all/545.1. htm)]. By looking at the source code, we can see that the link addresses of the articles inside all look like this
We can write regular expressions. $count=preg_match_all("/
The array $art_list[1][$s] contains the link address of an article. And $art_list[2][$s] contains the title of a certain article. At this point, it can be considered half the battle.
Then use a for loop to hit each link in turn, and then get the content in the same way as the title. The above are similar to the tutorials I found online, but when it comes to this for loop, the online tutorials are terrible. I haven't found an article that can explain this clearly. At the beginning, I used js to help the loop, or used Let me give you an example. This is what I did at first:
for($i=0;$i<20;4i++ {
The middle is the part of collecting content, omitted
I collected one page, I must collect another page
But it doesn't work when I use fopen to open the link. The request failed or something, and it didn't work with js. Finally I knew that I had to use this echo ""; where aa.php is ours The file name of the program and the number after the id can help us implement loops and collect multiple pages. This is the key to a true cycle
}
My brain is a bit uncomfortable and the writing is a bit messy, so just make do with it. In the eyes of experts, this may not be a big deal, but for novices like me, it is really helpful.