Sometimes because of work and our own needs, we will browse different websites to obtain the data we need, so crawlers come into being. The following is the process of developing a simple crawler and the problems encountered. To develop a crawler, first you need to know what your crawler is going to be used for. I want to use it to find articles with specific keywords on different websites and get their links so that I can read them quickly.
According to personal habits, I first need to write an interface and clarify my ideas.
1. Go to different websites. Then we need a url input box.
2. Find articles with specific keywords. Then we need an article title input box.
3. Get the article link. Then we need a display container for search results.
[xhtml] view plain copy <p class="jumbotron" id="mainJumbotron"> <p class="panel panel-default"> <p class="panel-heading">文章URL抓取</p> <p class="panel-body"> <p class="form-group"> <label for="article_title">文章标题</label> <input type="text" class="form-control" id="article_title" placeholder="文章标题"> </p> <p class="form-group"> <label for="website_url">网站URL</label> <input type="text" class="form-control" id="website_url" placeholder="网站URL"> </p> <button type="submit" class="btn btn-default">抓取</button> </p> </p> <p class="panel panel-default"> <p class="panel-heading">文章URL</p> <p class="panel-body"> <h3></h3> </p> </p> </p>
Directly enter the code, and then add some style adjustments of your own, and the interface is complete:
Then the next step is the implementation of the function Yes, I use PHP to write it. The first step is to get the html code of the website. There are many ways to get the html code. I won’t introduce them one by one. Here I use curl to get it. Pass in the website url and you can get the html. Code:
[xhtml] view plain copy private function get_html($url){ $ch = curl_init(); $timeout = 10; curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_ENCODING, 'gzip'); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $html = curl_exec($ch); return $html; }
Although you have obtained the html code, you will soon encounter a problem, that is, the encoding problem, which may make your next step of matching in vain. We will unify the obtained code here. Convert html content to utf8 encoding:
[php] view plain copy $coding = mb_detect_encoding($html); if ($coding != "UTF-8" || !mb_check_encoding($html, "UTF-8")) $html = mb_convert_encoding($html, 'utf-8', 'GBK,UTF-8,ASCII');
Get the html of the website and get the url of the article. Then the next step is to match all a tags under the web page. Regular expressions need to be used. After many tests , and finally got a more reliable regular expression. No matter how complex the structure under the a tag is, as long as it is a tag, it will not be missed: (the most critical step)
[php] view plain copy $pattern = '|<a[^>]*>(.*)</a>|isU'; preg_match_all($pattern, $html, $matches);
The matching results are in $matches , it is probably a multi-dimensional element group like this;
[js] view plain copy array(2) { [0]=> array(*) { [0]=> string(*) "完整的a标签" . . . } [1]=> array(*) { [0]=> string(*) "与上面下标相对应的a标签中的内容" } }
As long as you can get this data, everything else is completely operable. You can traverse this element group, find the a tag you want, and then get the corresponding a tag. You can operate it however you want. Here is a class recommended to make it easier for you to operate the a tag:
[php] view plain copy $dom = new DOMDocument(); @$dom->loadHTML($a);//$a是上面得到的一些a标签 $url = new DOMXPath($dom); $hrefs = $url->evaluate('//a'); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); //这里获取a标签的href属性 }
Of course, this is just one way, you can also use regular expressions to match what you want information and play new tricks with the data.
得到并匹配得出你想要的结果,下一步当然就是传回前端将他们显示出来啦,把接口写好,然后前端用js获取数据,用jquery动态添加内容显示出来:
[php] view plain copy var website_url = '你的接口地址'; $.getJSON(website_url,function(data){ if(data){ if(data.text == ''){ $('#article_url').html('<p><p>暂无该文章链接</p></p>'); return; } var string = ''; var list = data.text; for (var j in list) { var content = list[j].url_content; for (var i in content) { if (content[i].title != '') { string += '<p class="item">' + '<em>[<a href="http://' + list[j].website.web_url + '" target="_blank">' + list[j].website.web_name + '</a>]</em>' + '<a href=" ' + content[i].url + '" target="_blank" class="web_url">' + content[i].title + '</a>' + '</p>'; } } } $('#article_url').html(string); });
上最终效果图:
The above is the detailed content of How to develop a simple crawler with PHP. For more information, please follow other related articles on the PHP Chinese website!