The example in this article describes the design and implementation method of the PHP thief program. Share it with everyone for your reference, the details are as follows:
In fact, I have always wanted to make a website with connotative pictures. My previous idea was to make a CMS and upload some pictures myself. .
When I started doing this, I had no motivation. I gave up after that and later studied a CURL. Anyway, it would be better to implement this idea.
Using PHP to steal pictures is like wearing socks and sandals. Although it's okay, it does hurt to look at it.
Let me first talk about my design of the PHP thief program. PHP does not support multi-threading, so it can only be done in order
Get the HTML page of the target website + parse the HTML page to get the connection to the image storage + Read in binary mode and save locally + rename == The process is OK
You now use two ways to run the program:
The first way: run the program with a browser (most likely it will freeze, set the timeout and memory size Just OK, it will be difficult for you to wait)
Another option: start PHP from the command line (there is no PHP timeout problem)
/** *HTML解析类 *author:Summer *date:2014-08-22 **/ class Analytical{ public function __construct() { require_once('Class/SimpleHtmlDom.class.php'); $this->_getDir(); } private function _getDir() { $dir = "../TMP/HTML/Results/1"; $imgBIG = "../TMP/IMG/JPG/BIG"; $it = new DirectoryIterator($dir."/"); foreach($it as $file) { //用isDot ()方法分别过滤掉“.”和“..”目录 if (!$it->isDot()) { $dirs = $dir."/".$file ; $tmp = explode(".",$file); $html = file_get_html($dirs); $ulArr = $html->find('img'); foreach($ulArr as $key=>$value) { if ($value->class == "u") { $url = <a>http://</a>www.jb51.net.$value->src; $infomation = file_get_contents($url); $result = $this->saveHtml($infomation, $imgBIG, $tmp['0'].".jpg"); if($result) { echo $file."OK\n"; } } } } } } private function saveHtml($infomation,$filedir,$filename) { if(!$this->mkdirs($filedir)) { return 0; } $sf = $filedir."/".$filename; $fp=fopen($sf,"w"); //写方式打开文件 return fwrite($fp,$infomation); //存入内容 fclose($fp); //关闭文件 } //创建目录 private function mkdirs($dir) { if(!is_dir($dir)) { if(!$this->mkdirs(dirname($dir))){ return false; } if(!mkdir($dir,0777)){ return false; } } return true; } } new Analytical();
The above is the process of obtaining the IMG connection address from the HTML page.
Two important things are used:
1. PHP’s DOM parsing extension simplehtmldom
2. PHP’s directory iterator
I understand these two things. There is no difficulty in this analysis class.
How about getting the page that needs to be parsed?
In fact, the principle is the same as above. Mainly get the URL of the page, then read the page through CURL, return an HTML string, and then save the HTML page locally through the save function package.
I am here because I want to collect the pictures on the page (to prevent others from hotlinking), so the design is relatively complicated.
And why it needs to be separated is because the simplehtmldom object is very large, and taking it apart will make the process clearer.
Some people will definitely say, then why not use regular matching to skip the link of saving HTML to local, BINGO! I just can't be bothered to write regular rules.
For more detailed explanations of the design and implementation methods of the PHP thief program, please pay attention to the PHP Chinese website!