Crawler_movie ftp download address
Site: http://www.dy2018.com/
Database: mysql Account: root Password: 123456
Create table statement: CREATE TABLE dy2008_url (id int(9) NOT NULL AUTO_INCREMENT, url varchar(2000) NOT NULL , status tinyint(2) NOT NULL, PRIMARY KEY(id));
Code:
<?php declare(ticks = 1); pcntl_signal(SIGQUIT, 'signal_handler'); pcntl_signal(SIGTERM, 'signal_handler'); $crawlers_pid = array(); $finish_count = 0; //信号处理函数 function signal_handler($signal) { global $crawlers_pid; if ($signal == SIGQUIT || $signal == SIGTERM) { foreach ($crawlers_pid as $pid) { posix_kill($pid,SIGTERM); } echo "---------- crawl task exit ----------"; global $con;//mysql exit(); } } //GET方式获取链接对应页面内容 function get_page_content($url) { $content = file_get_contents($url); return $content; } //POST方式获取链接对应页面内容 function get_page_content_by_post($url, $arr) { $arr = http_build_query($arr); $opts = array ( 'http' => array('method' => 'POST', 'header' => 'Content-type:application/x-www-form-urlencoded'.' Content-Length:'.strlen($data).'"', 'content' => $data) ); $context = stream_context_create($opts); $content = file_get_contents($url,false,$context); return $content; } //dy2018抓取主流程 function run_dy2018() { global $crawlers_pid; global $finish_count; $crawl_urls = array("http://www.dy2018.com/html/tv/hytv/", "http://www.dy2018.com/html/tv/hepai/", "http://www.dy2018.com/html/tv/gangtai/", "http://www.dy2018.com/html/tv/oumeitv/", "http://www.dy2018.com/html/tv/rihantv/", "http://www.dy2018.com/html/tv/tvzz/", "http://www.dy2018.com/0/", "http://www.dy2018.com/1/", "http://www.dy2018.com/2/", "http://www.dy2018.com/3/", "http://www.dy2018.com/4/", "http://www.dy2018.com/5/", "http://www.dy2018.com/6/", "http://www.dy2018.com/7/", "http://www.dy2018.com/8/", "http://www.dy2018.com/9/", "http://www.dy2018.com/10/", "http://www.dy2018.com/11/", "http://www.dy2018.com/12/", "http://www.dy2018.com/13/", "http://www.dy2018.com/14/", "http://www.dy2018.com/15/", "http://www.dy2018.com/16/", "http://www.dy2018.com/17/", "http://www.dy2018.com/18/", "http://www.dy2018.com/19/", "http://www.dy2018.com/20/"); $i = 0; while($i < count($crawl_urls)) { $pid = pcntl_fork(); if($pid == -1) { echo "system error. check it now!"; exit(); } else if($pid > 0){ $crawlers_pid[$i] = $pid; } else { $url = $crawl_urls[$i]; $con = mysql_connect("localhost", "root", "123456"); if(!$con) { die('Count not connect: '.mysql_error()); } mysql_select_db("mysql", $con); crawl_process($url); $finish_count++; } $i++; } //pcntl_waitpid可能会导致信号监听失败 while (true) { if($finish_count == count($crawlers_pid)) { echo "---------- crawl task finish ----------"; mysql_close(); exit(); } sleep(1); } } //从入口链接到其下所有下载页链接抓取过程 function crawl_process($url) { echo "start handle url:".$url; $page_idx = 1; $valid_tag = true; $info_url_pattern = '/\/i\/\d+.html/'; $ftp_url_pattern = '/ftp:\/\/.*?.(swf|avi|flv|mpg|rm|mov|wav|asf|3gp|mkv|rmvb)/i';//^$两个符号不起作用 while($valid_tag) { $page_url = get_page_index_url($url, $page_idx); printf("start crawl url:".$page_url."\n"); $page_content = get_page_content($page_url); $valid_tag = is_valid_page($page_content); if($valid_tag) { $matches_urls = array(); preg_match_all($info_url_pattern, $page_content, $matches_urls); $page_content = mb_convert_encoding($page_content, "UTF-8", "GBK"); for($i=0; $i<count($matches_urls[0]); $i++) { $detail_url = 'http://www.dy2018.com'.$matches_urls[0][$i]; $detail_page_content = get_page_content($detail_url); $detail_page_content = mb_convert_encoding($detail_page_content, "UTF-8", "GBK"); preg_match_all($ftp_url_pattern, $detail_page_content, $ftp_urls); $ftp_links = array(); for($j=0;$j<count($ftp_urls[0]); $j++) { $ftp_links[$j] = $ftp_urls[0][$j]; } $ftp_links_unique = array_values(array_unique($ftp_links)); foreach ($ftp_links_unique as $ftp_link) { mysql_query("insert into dy2018_url (url, status) values('$ftp_link','0')"); // echo mysql_error();//打印mysql错误 } sleep(1); } } $page_idx++; } } //获取页码对应的url链接 function get_page_index_url($url, $idx) { $idx_url = $url; if($idx == 1) { $idx_url = $idx_url.'index.html'; } else if($idx > 1){ $idx_url = $idx_url.'index_'.$idx.'.html'; } return $idx_url; } //根据页面内容判断链接是否有效 function is_valid_page($content) { return $content?true:false; } run_dy2018(); mysql_close(); ?>
Result:
The above introduces the crawler_movie ftp download address, including the relevant content. I hope it will be helpful to friends who are interested in PHP tutorials.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

HTTP status code 520 means that the server encountered an unknown error while processing the request and cannot provide more specific information. Used to indicate that an unknown error occurred when the server was processing the request, which may be caused by server configuration problems, network problems, or other unknown reasons. This is usually caused by server configuration issues, network issues, server overload, or coding errors. If you encounter a status code 520 error, it is best to contact the website administrator or technical support team for more information and assistance.

The reason for the error is NameResolutionError(self.host,self,e)frome, which is an exception type in the urllib3 library. The reason for this error is that DNS resolution failed, that is, the host name or IP address attempted to be resolved cannot be found. This may be caused by the entered URL address being incorrect or the DNS server being temporarily unavailable. How to solve this error There may be several ways to solve this error: Check whether the entered URL address is correct and make sure it is accessible Make sure the DNS server is available, you can try using the "ping" command on the command line to test whether the DNS server is available Try accessing the website using the IP address instead of the hostname if behind a proxy

How to use NginxProxyManager to implement automatic jump from HTTP to HTTPS. With the development of the Internet, more and more websites are beginning to use the HTTPS protocol to encrypt data transmission to improve data security and user privacy protection. Since the HTTPS protocol requires the support of an SSL certificate, certain technical support is required when deploying the HTTPS protocol. Nginx is a powerful and commonly used HTTP server and reverse proxy server, and NginxProxy

Understand the meaning of HTTP 301 status code: common application scenarios of web page redirection. With the rapid development of the Internet, people's requirements for web page interaction are becoming higher and higher. In the field of web design, web page redirection is a common and important technology, implemented through the HTTP 301 status code. This article will explore the meaning of HTTP 301 status code and common application scenarios in web page redirection. HTTP301 status code refers to permanent redirect (PermanentRedirect). When the server receives the client's

HTTP status code 403 means that the server rejected the client's request. The solution to http status code 403 is: 1. Check the authentication credentials. If the server requires authentication, ensure that the correct credentials are provided; 2. Check the IP address restrictions. If the server has restricted the IP address, ensure that the client's IP address is restricted. Whitelisted or not blacklisted; 3. Check the file permission settings. If the 403 status code is related to the permission settings of the file or directory, ensure that the client has sufficient permissions to access these files or directories, etc.

Differences: 1. Different definitions, url is a uniform resource locator, and html is a hypertext markup language; 2. There can be many urls in an html, but only one html page can exist in a url; 3. html refers to is a web page, and url refers to the website address.

Quick Application: Practical Development Case Analysis of PHP Asynchronous HTTP Download of Multiple Files With the development of the Internet, the file download function has become one of the basic needs of many websites and applications. For scenarios where multiple files need to be downloaded at the same time, the traditional synchronous download method is often inefficient and time-consuming. For this reason, using PHP to download multiple files asynchronously over HTTP has become an increasingly common solution. This article will analyze in detail how to use PHP asynchronous HTTP through an actual development case.

Solution: 1. Check the Content-Type in the request header; 2. Check the data format in the request body; 3. Use the appropriate encoding format; 4. Use the appropriate request method; 5. Check the server-side support.
