Development of a simple crawler-PHP开发-php.cn

Home

php教程

PHP开发

Development of a simple crawler

高洛峰

Nov 22, 2016 pm 05:28 PM

To develop a crawler, first you need to know what your crawler is going to be used for. I want to use it to find articles with specific keywords on different websites and get their links so that I can read them quickly.

According to personal habits, I first need to write an interface and clarify my ideas.

1. Go to different websites. Then we need a url input box.

2. Find articles with specific keywords. Then we need an article title input box.

3. Get the article link. Then we need a display container for search results.

<div class="jumbotron" id="mainJumbotron">
    <div class="panel panel-default">

        <div class="panel-heading">文章URL抓取</div>

        <div class="panel-body">
            <div class="form-group">
                <label for="article_title">文章标题</label>
                <input type="text" class="form-control" id="article_title" placeholder="文章标题">
            </div>
            <div class="form-group">
                <label for="website_url">网站URL</label>
                <input type="text" class="form-control" id="website_url" placeholder="网站URL">
            </div>

            <button type="submit" class="btn btn-default">抓取</button>
        </div>
    </div>
    <div class="panel panel-default">

        <div class="panel-heading">文章URL</div>

        <div class="panel-body">
            <h3></h3>
        </div>
    </div>
</div>

Copy after login

Go directly to the code, and then add some style adjustments of your own, and the interface is complete:

Development of a simple crawler

Then the next step is to implement the function. I use PHP to write it. The first step is to get the html of the website. Code, there are many ways to get the html code, I won’t introduce them one by one. Here I use curl to get it, and you can get the html code by passing in the website url:

private function get_html($url){

    $ch = curl_init();

    $timeout = 10;

    curl_setopt($ch, CURLOPT_URL, $url);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

    curl_setopt($ch, CURLOPT_ENCODING, &#39;gzip&#39;);

    curl_setopt($ch, CURLOPT_USERAGENT, &#39;Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36&#39;);

    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);

    $html = curl_exec($ch);

    return $html;

}

Copy after login

Although you got the html code, you will soon know Encountered a problem, that is, the encoding problem, which may make your next step of matching in vain. Here we uniformly convert the obtained html content to utf8 encoding:

$coding = mb_detect_encoding($html);

if ($coding != "UTF-8" || !mb_check_encoding($html, "UTF-8"))

    $html = mb_convert_encoding($html, &#39;utf-8&#39;, &#39;GBK,UTF-8,ASCII&#39;);

Copy after login

Get the html of the website and get the url of the article. Then the next step is to match all a tags under the web page, which requires the use of regular expressions. After many tests, we finally got a more reliable regular expression. No matter how complex the structure under the a tag is, as long as it is a tag Don’t miss it: (the most critical step)

$pattern = &#39;|<a[^>]*>(.*)</a>|isU&#39;;

preg_match_all($pattern, $html, $matches);

Copy after login

The matching result is in $matches, which is probably a multi-dimensional group like this:

array(2) {  
    [0]=>  
    array(*) {    
        [0]=>
        string(*) "完整的a标签"        .
        .
        .
    }
    [1]=>
    array(*) {
        [0]=>
        string(*) "与上面下标相对应的a标签中的内容"    }
}

Copy after login

As long as you can get this data, everything else is completely operable, you can Traverse this element group, find the a tag you want, and then get the corresponding attributes of the a tag. You can operate it however you want. Here is a recommended class to make it easier for you to operate the a tag:

$dom = new DOMDocument();

@$dom->loadHTML($a);//$a是上面得到的一些a标签

$url = new DOMXPath($dom);

$hrefs = $url->evaluate(&#39;//a&#39;);

for ($i = 0; $i < $hrefs->length; $i++) {

    $href = $hrefs->item($i);

    $url = $href->getAttribute(&#39;href&#39;); //这里获取a标签的href属性

}

Copy after login

Of course, this is just one method method, you can also use regular expressions to match the information you want and play new tricks with the data.

Get and match the results you want. The next step is of course to send them back to the front end to display them. Write the interface, then use js to get the data on the front end, and use jquery to dynamically add content and display it:

var website_url = &#39;你的接口地址&#39;;
$.getJSON(website_url,function(data){
    if(data){
        if(data.text == &#39;&#39;){
            $(&#39;#article_url&#39;).html(&#39;<div><p>暂无该文章链接</p></div>&#39;);
            return;
        }
        var string = &#39;&#39;;
        var list = data.text;
        for (var j in list) {
                var content = list[j].url_content;
                for (var i in content) {
                    if (content[i].title != &#39;&#39;) {
                        string += &#39;<div class="item">&#39; +
                            &#39;<em>[<a href="http://&#39; + list[j].website.web_url + &#39;" target="_blank">&#39; + list[j].website.web_name + &#39;</a>]</em>&#39; +
                            &#39;<a href=" &#39; + content[i].url + &#39;" target="_blank" class="web_url">&#39; + content[i].title + &#39;</a>&#39; +
                            &#39;</div>&#39;;
                    }
                }
            }
        $(&#39;#article_url&#39;).html(string);
});

Copy after login

Up Final rendering:

Development of a simple crawler

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn