Home Backend Development PHP Tutorial A simple crawler implemented in PHP

A simple crawler implemented in PHP

Jul 29, 2016 am 08:58 AM
content info isset return url

The function of this small crawler is to crawl the URL of the target web page and implement recursive crawling. This small demo is based on the code of netizens and then modified by myself. Since there are too many online versions, I will not @ the original author (I don’t know who is the real author)

The following is the code:

<code><span><?php</span><span>//爬虫类</span><span><span>class</span><span>Crawler</span>{</span><span>private</span><span>$url</span>;
    <span>public</span><span><span>function</span><span>__construct</span><span>(<span>$url</span>)</span>{</span><span>if</span>(!preg_match(<span>"/^(http)s?/"</span>, <span>$url</span>)){
            <span>$url</span> = <span>"http://"</span>.<span>$url</span>;
        }
        <span>$this</span>->url = <span>$url</span>;
    }
    <span>//从给定的url中获取html内容</span><span>protected</span><span><span>function</span><span>_getUrlContent</span><span>(<span>$url</span>)</span>{</span>
        @<span>$handle</span> = fopen(<span>$url</span>, <span>"r"</span>);
        <span>if</span>(error_get_last()){<span>//捕获异常(不一定是错误)</span><span>$err</span> = <span>new</span><span>Exception</span>(<span>"你的URL好像不对!要不换一个?"</span>);
            <span>echo</span><span>$err</span>->getMessage();
            <span>return</span>;
        }
        <span>if</span>(<span>$handle</span>){
            <span>$content</span> = stream_get_contents(<span>$handle</span>,<span>1024</span>*<span>1024</span>);<span>//将资源流读入字符串</span><span>return</span><span>$content</span>;
        }<span>else</span>{
            <span>return</span><span>false</span>;
        }   
    }
    <span>//从html内容中筛选链接</span><span>protected</span><span><span>function</span><span>_filterUrl</span><span>(<span>$web_content</span>)</span>{</span><span>$reg_tag_a</span> = <span>'/<[a|A].*?href=[\'\"]{0,1}([^>\'\"\ ]*).*?>/'</span>;
        <span>$result</span> = preg_match_all(<span>$reg_tag_a</span>,<span>$web_content</span>,<span>$match_result</span>);
        <span>if</span>(<span>$result</span>){
            <span>return</span><span>$match_result</span>[<span>1</span>];
        }
    }
    <span>//判断是否是完整的url</span><span>protected</span><span><span>function</span><span>_judgeURL</span><span>(<span>$url</span>)</span>{</span><span>$url_info</span> = parse_url(<span>$url</span>);
        <span>if</span>(<span>isset</span>(<span>$url_info</span>[<span>'scheme'</span>])||<span>isset</span>(<span>$url_info</span>[<span>'host'</span>])){
            <span>return</span><span>true</span>;
        }
        <span>return</span><span>false</span>;
    }
    <span>//修正相对路径</span><span>protected</span><span><span>function</span><span>_reviseUrl</span><span>(<span>$base_url</span>,<span>$url_list</span>)</span>{</span><span>$url_info</span> = parse_url(<span>$base_url</span>);<span>//分解url中的各个部分</span><span>unset</span>(<span>$base_url</span>);
        <span>$base_url</span> = <span>isset</span>(<span>$url_info</span>[<span>"scheme"</span>])?<span>$url_info</span>[<span>"scheme"</span>].<span>'://'</span>:<span>""</span>;<span>//$url_info["scheme"]为http、ftp等</span><span>if</span>(<span>isset</span>(<span>$url_info</span>[<span>"user"</span>]) && <span>isset</span>(<span>$url_info</span>[<span>"pass"</span>])){<span>//记录用户名及密码的url</span><span>$base_url</span> .= <span>$url_info</span>[<span>"user"</span>].<span>":"</span>.<span>$url_info</span>[<span>"pass"</span>].<span>"@"</span>;
        }
        <span>$base_url</span> .= <span>isset</span>(<span>$url_info</span>[<span>"host"</span>])?<span>$url_info</span>[<span>"host"</span>]:<span>""</span>;<span>//$url_info["host"]域名</span><span>if</span>(<span>isset</span>(<span>$url_info</span>[<span>"port"</span>])){<span>//$url_info["port"]端口,8080等</span><span>$base_url</span> .= <span>":"</span>.<span>$url_info</span>[<span>"port"</span>];
        }
        <span>$base_url</span> .= <span>isset</span>(<span>$url_info</span>[<span>"path"</span>])?<span>$url_info</span>[<span>"path"</span>]:<span>""</span>;<span>//$url_info["path"]路径</span><span>//目前为止,绝对路径前面已经组装完</span><span>if</span>(is_array(<span>$url_list</span>)){
            <span>foreach</span> (<span>$url_list</span><span>as</span><span>$url_item</span>) {
                <span>// if(preg_match('/^(http)s?/',$url_item)){</span><span>if</span>(<span>$this</span>->_judgeURL(<span>$url_item</span>)){
                    <span>//已经是完整的url</span><span>$result</span>[] = <span>$url_item</span>;
                }<span>else</span> {
                    <span>//不完整的url</span><span>$real_url</span> = <span>$base_url</span>.<span>$url_item</span>;
                    <span>$result</span>[] = <span>$real_url</span>;
                }
            }
            <span>return</span><span>$result</span>;
        }<span>else</span> {
            <span>return</span>;
        }
    }
    <span>//爬虫</span><span>public</span><span><span>function</span><span>crawler</span><span>()</span>{</span><span>$content</span> = <span>$this</span>->_getUrlContent(<span>$this</span>->url);
        <span>if</span>(<span>$content</span>){
            <span>$url_list</span> = <span>$this</span>->_reviseUrl(<span>$this</span>->url,<span>$this</span>->_filterUrl(<span>$content</span>));
            <span>if</span>(<span>$url_list</span>){
                <span>return</span><span>$url_list</span>;
            }<span>else</span> {
                <span>return</span> ;
            }
        }<span>else</span>{
            <span>return</span> ;
        }
    }
}


<span>$fp_puts</span> = fopen(<span>"url.txt"</span>,<span>"ab"</span>);<span>//记录url列表</span><span>$fp_gets</span> = fopen(<span>"url.txt"</span>,<span>"r"</span>);<span>//保存url列表</span><span>$current_url</span> = <span>"www.baidu.com"</span>;
<span>do</span>{
    <span>$Crawler</span> = <span>new</span> Crawler(<span>$current_url</span>);
    <span>$url_arr</span> = <span>$Crawler</span>->crawler();
    <span>if</span>(<span>$url_arr</span>){
        <span>foreach</span> (<span>$url_arr</span><span>as</span><span>$url</span>) {
            fputs(<span>$fp_puts</span>,<span>$url</span>.<span>"\n"</span>);
        }
    }
}<span>while</span> (<span>$current_url</span> = fgets(<span>$fp_gets</span>,<span>1024</span>));<span>//不断获得url</span><span>// echo "<pre class="brush:php;toolbar:false">";</span><span>// var_dump($url_arr);</span><span>// echo "<pre/>";</span><span>?></span></span></code>
Copy after login

Because in There may be a lot of new objects during the loop. At that time, I thought of using the singleton mode to avoid excessive memory overhead. Later, I found it too troublesome and let it go. . . .

').addClass('pre-numbering').hide(); $(this).addClass('has-numbering').parent().append($numbering); for (i = 1; i ').text(i)); }; $numbering.fadeIn(1700); }); });

The above introduces a simple crawler implemented in PHP, including various aspects. I hope it will be helpful to friends who are interested in PHP tutorials.

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Detailed explanation of the usage of return in C language Detailed explanation of the usage of return in C language Oct 07, 2023 am 10:58 AM

The usage of return in C language is: 1. For functions whose return value type is void, you can use the return statement to end the execution of the function early; 2. For functions whose return value type is not void, the function of the return statement is to end the execution of the function. The result is returned to the caller; 3. End the execution of the function early. Inside the function, we can use the return statement to end the execution of the function early, even if the function does not return a value.

PHP function introduction—get_headers(): Get the response header information of the URL PHP function introduction—get_headers(): Get the response header information of the URL Jul 25, 2023 am 09:05 AM

PHP function introduction—get_headers(): Overview of obtaining the response header information of the URL: In PHP development, we often need to obtain the response header information of the web page or remote resource. The PHP function get_headers() can easily obtain the response header information of the target URL and return it in the form of an array. This article will introduce the usage of get_headers() function and provide some related code examples. Usage of get_headers() function: get_header

How to get your Steam ID in a few steps? How to get your Steam ID in a few steps? May 08, 2023 pm 11:43 PM

Nowadays, many Windows users who love games have entered the Steam client and can search, download and play any good games. However, many users' profiles may have the exact same name, making it difficult to find a profile or even link a Steam profile to other third-party accounts or join Steam forums to share content. The profile is assigned a unique 17-digit id, which remains the same and cannot be changed by the user at any time, whereas the username or custom URL can. Regardless, some users don't know their Steamid, and it's important to know this. If you don't know how to find your account's Steamid, don't panic. In this article

Why NameResolutionError(self.host, self, e) from e and how to solve it Why NameResolutionError(self.host, self, e) from e and how to solve it Mar 01, 2024 pm 01:20 PM

The reason for the error is NameResolutionError(self.host,self,e)frome, which is an exception type in the urllib3 library. The reason for this error is that DNS resolution failed, that is, the host name or IP address attempted to be resolved cannot be found. This may be caused by the entered URL address being incorrect or the DNS server being temporarily unavailable. How to solve this error There may be several ways to solve this error: Check whether the entered URL address is correct and make sure it is accessible Make sure the DNS server is available, you can try using the "ping" command on the command line to test whether the DNS server is available Try accessing the website using the IP address instead of the hostname if behind a proxy

How to use URL encoding and decoding in Java How to use URL encoding and decoding in Java May 08, 2023 pm 05:46 PM

Use url to encode and decode the class java.net.URLDecoder.decode(url, decoding format) decoder.decoding method for encoding and decoding. Convert into an ordinary string, URLEncoder.decode(url, encoding format) turns the ordinary string into a string in the specified format packagecom.zixue.springbootmybatis.test;importjava.io.UnsupportedEncodingException;importjava.net.URLDecoder;importjava.net. URLEncoder

What is the execution order of return and finally statements in Java? What is the execution order of return and finally statements in Java? Apr 25, 2023 pm 07:55 PM

Source code: publicclassReturnFinallyDemo{publicstaticvoidmain(String[]args){System.out.println(case1());}publicstaticintcase1(){intx;try{x=1;returnx;}finally{x=3;}}}#Output The output of the above code can simply conclude: return is executed before finally. Let's take a look at what happens at the bytecode level. The following intercepts part of the bytecode of the case1 method, and compares the source code to annotate the meaning of each instruction in

What is the difference between html and url What is the difference between html and url Mar 06, 2024 pm 03:06 PM

Differences: 1. Different definitions, url is a uniform resource locator, and html is a hypertext markup language; 2. There can be many urls in an html, but only one html page can exist in a url; 3. html refers to is a web page, and url refers to the website address.

Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Jun 22, 2023 pm 01:57 PM

Scrapy is a powerful Python crawler framework that can be used to obtain large amounts of data from the Internet. However, when developing Scrapy, we often encounter the problem of crawling duplicate URLs, which wastes a lot of time and resources and affects efficiency. This article will introduce some Scrapy optimization techniques to reduce the crawling of duplicate URLs and improve the efficiency of Scrapy crawlers. 1. Use the start_urls and allowed_domains attributes in the Scrapy crawler to

See all articles