A simple crawler implemented in PHP
The function of this small crawler is to crawl the URL of the target web page and implement recursive crawling. This small demo is based on the code of netizens and then modified by myself. Since there are too many online versions, I will not @ the original author (I don’t know who is the real author)
The following is the code:
<code><span><?php</span><span>//爬虫类</span><span><span>class</span><span>Crawler</span>{</span><span>private</span><span>$url</span>; <span>public</span><span><span>function</span><span>__construct</span><span>(<span>$url</span>)</span>{</span><span>if</span>(!preg_match(<span>"/^(http)s?/"</span>, <span>$url</span>)){ <span>$url</span> = <span>"http://"</span>.<span>$url</span>; } <span>$this</span>->url = <span>$url</span>; } <span>//从给定的url中获取html内容</span><span>protected</span><span><span>function</span><span>_getUrlContent</span><span>(<span>$url</span>)</span>{</span> @<span>$handle</span> = fopen(<span>$url</span>, <span>"r"</span>); <span>if</span>(error_get_last()){<span>//捕获异常(不一定是错误)</span><span>$err</span> = <span>new</span><span>Exception</span>(<span>"你的URL好像不对!要不换一个?"</span>); <span>echo</span><span>$err</span>->getMessage(); <span>return</span>; } <span>if</span>(<span>$handle</span>){ <span>$content</span> = stream_get_contents(<span>$handle</span>,<span>1024</span>*<span>1024</span>);<span>//将资源流读入字符串</span><span>return</span><span>$content</span>; }<span>else</span>{ <span>return</span><span>false</span>; } } <span>//从html内容中筛选链接</span><span>protected</span><span><span>function</span><span>_filterUrl</span><span>(<span>$web_content</span>)</span>{</span><span>$reg_tag_a</span> = <span>'/<[a|A].*?href=[\'\"]{0,1}([^>\'\"\ ]*).*?>/'</span>; <span>$result</span> = preg_match_all(<span>$reg_tag_a</span>,<span>$web_content</span>,<span>$match_result</span>); <span>if</span>(<span>$result</span>){ <span>return</span><span>$match_result</span>[<span>1</span>]; } } <span>//判断是否是完整的url</span><span>protected</span><span><span>function</span><span>_judgeURL</span><span>(<span>$url</span>)</span>{</span><span>$url_info</span> = parse_url(<span>$url</span>); <span>if</span>(<span>isset</span>(<span>$url_info</span>[<span>'scheme'</span>])||<span>isset</span>(<span>$url_info</span>[<span>'host'</span>])){ <span>return</span><span>true</span>; } <span>return</span><span>false</span>; } <span>//修正相对路径</span><span>protected</span><span><span>function</span><span>_reviseUrl</span><span>(<span>$base_url</span>,<span>$url_list</span>)</span>{</span><span>$url_info</span> = parse_url(<span>$base_url</span>);<span>//分解url中的各个部分</span><span>unset</span>(<span>$base_url</span>); <span>$base_url</span> = <span>isset</span>(<span>$url_info</span>[<span>"scheme"</span>])?<span>$url_info</span>[<span>"scheme"</span>].<span>'://'</span>:<span>""</span>;<span>//$url_info["scheme"]为http、ftp等</span><span>if</span>(<span>isset</span>(<span>$url_info</span>[<span>"user"</span>]) && <span>isset</span>(<span>$url_info</span>[<span>"pass"</span>])){<span>//记录用户名及密码的url</span><span>$base_url</span> .= <span>$url_info</span>[<span>"user"</span>].<span>":"</span>.<span>$url_info</span>[<span>"pass"</span>].<span>"@"</span>; } <span>$base_url</span> .= <span>isset</span>(<span>$url_info</span>[<span>"host"</span>])?<span>$url_info</span>[<span>"host"</span>]:<span>""</span>;<span>//$url_info["host"]域名</span><span>if</span>(<span>isset</span>(<span>$url_info</span>[<span>"port"</span>])){<span>//$url_info["port"]端口,8080等</span><span>$base_url</span> .= <span>":"</span>.<span>$url_info</span>[<span>"port"</span>]; } <span>$base_url</span> .= <span>isset</span>(<span>$url_info</span>[<span>"path"</span>])?<span>$url_info</span>[<span>"path"</span>]:<span>""</span>;<span>//$url_info["path"]路径</span><span>//目前为止,绝对路径前面已经组装完</span><span>if</span>(is_array(<span>$url_list</span>)){ <span>foreach</span> (<span>$url_list</span><span>as</span><span>$url_item</span>) { <span>// if(preg_match('/^(http)s?/',$url_item)){</span><span>if</span>(<span>$this</span>->_judgeURL(<span>$url_item</span>)){ <span>//已经是完整的url</span><span>$result</span>[] = <span>$url_item</span>; }<span>else</span> { <span>//不完整的url</span><span>$real_url</span> = <span>$base_url</span>.<span>$url_item</span>; <span>$result</span>[] = <span>$real_url</span>; } } <span>return</span><span>$result</span>; }<span>else</span> { <span>return</span>; } } <span>//爬虫</span><span>public</span><span><span>function</span><span>crawler</span><span>()</span>{</span><span>$content</span> = <span>$this</span>->_getUrlContent(<span>$this</span>->url); <span>if</span>(<span>$content</span>){ <span>$url_list</span> = <span>$this</span>->_reviseUrl(<span>$this</span>->url,<span>$this</span>->_filterUrl(<span>$content</span>)); <span>if</span>(<span>$url_list</span>){ <span>return</span><span>$url_list</span>; }<span>else</span> { <span>return</span> ; } }<span>else</span>{ <span>return</span> ; } } } <span>$fp_puts</span> = fopen(<span>"url.txt"</span>,<span>"ab"</span>);<span>//记录url列表</span><span>$fp_gets</span> = fopen(<span>"url.txt"</span>,<span>"r"</span>);<span>//保存url列表</span><span>$current_url</span> = <span>"www.baidu.com"</span>; <span>do</span>{ <span>$Crawler</span> = <span>new</span> Crawler(<span>$current_url</span>); <span>$url_arr</span> = <span>$Crawler</span>->crawler(); <span>if</span>(<span>$url_arr</span>){ <span>foreach</span> (<span>$url_arr</span><span>as</span><span>$url</span>) { fputs(<span>$fp_puts</span>,<span>$url</span>.<span>"\n"</span>); } } }<span>while</span> (<span>$current_url</span> = fgets(<span>$fp_gets</span>,<span>1024</span>));<span>//不断获得url</span><span>// echo "<pre class="brush:php;toolbar:false">";</span><span>// var_dump($url_arr);</span><span>// echo "<pre/>";</span><span>?></span></span></code>
Because in There may be a lot of new objects during the loop. At that time, I thought of using the singleton mode to avoid excessive memory overhead. Later, I found it too troublesome and let it go. . . .
').addClass('pre-numbering').hide(); $(this).addClass('has-numbering').parent().append($numbering); for (i = 1; i ').text(i)); }; $numbering.fadeIn(1700); }); });The above introduces a simple crawler implemented in PHP, including various aspects. I hope it will be helpful to friends who are interested in PHP tutorials.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

The usage of return in C language is: 1. For functions whose return value type is void, you can use the return statement to end the execution of the function early; 2. For functions whose return value type is not void, the function of the return statement is to end the execution of the function. The result is returned to the caller; 3. End the execution of the function early. Inside the function, we can use the return statement to end the execution of the function early, even if the function does not return a value.

PHP function introduction—get_headers(): Overview of obtaining the response header information of the URL: In PHP development, we often need to obtain the response header information of the web page or remote resource. The PHP function get_headers() can easily obtain the response header information of the target URL and return it in the form of an array. This article will introduce the usage of get_headers() function and provide some related code examples. Usage of get_headers() function: get_header

Nowadays, many Windows users who love games have entered the Steam client and can search, download and play any good games. However, many users' profiles may have the exact same name, making it difficult to find a profile or even link a Steam profile to other third-party accounts or join Steam forums to share content. The profile is assigned a unique 17-digit id, which remains the same and cannot be changed by the user at any time, whereas the username or custom URL can. Regardless, some users don't know their Steamid, and it's important to know this. If you don't know how to find your account's Steamid, don't panic. In this article

The reason for the error is NameResolutionError(self.host,self,e)frome, which is an exception type in the urllib3 library. The reason for this error is that DNS resolution failed, that is, the host name or IP address attempted to be resolved cannot be found. This may be caused by the entered URL address being incorrect or the DNS server being temporarily unavailable. How to solve this error There may be several ways to solve this error: Check whether the entered URL address is correct and make sure it is accessible Make sure the DNS server is available, you can try using the "ping" command on the command line to test whether the DNS server is available Try accessing the website using the IP address instead of the hostname if behind a proxy

Use url to encode and decode the class java.net.URLDecoder.decode(url, decoding format) decoder.decoding method for encoding and decoding. Convert into an ordinary string, URLEncoder.decode(url, encoding format) turns the ordinary string into a string in the specified format packagecom.zixue.springbootmybatis.test;importjava.io.UnsupportedEncodingException;importjava.net.URLDecoder;importjava.net. URLEncoder

Source code: publicclassReturnFinallyDemo{publicstaticvoidmain(String[]args){System.out.println(case1());}publicstaticintcase1(){intx;try{x=1;returnx;}finally{x=3;}}}#Output The output of the above code can simply conclude: return is executed before finally. Let's take a look at what happens at the bytecode level. The following intercepts part of the bytecode of the case1 method, and compares the source code to annotate the meaning of each instruction in

Differences: 1. Different definitions, url is a uniform resource locator, and html is a hypertext markup language; 2. There can be many urls in an html, but only one html page can exist in a url; 3. html refers to is a web page, and url refers to the website address.

Scrapy is a powerful Python crawler framework that can be used to obtain large amounts of data from the Internet. However, when developing Scrapy, we often encounter the problem of crawling duplicate URLs, which wastes a lot of time and resources and affects efficiency. This article will introduce some Scrapy optimization techniques to reduce the crawling of duplicate URLs and improve the efficiency of Scrapy crawlers. 1. Use the start_urls and allowed_domains attributes in the Scrapy crawler to
