Table of Contents
回复讨论(解决方案)
Home Backend Development PHP Tutorial 请教 寻找PHP采集大量网页高效可行的方法

请教 寻找PHP采集大量网页高效可行的方法

Jun 23, 2016 pm 01:50 PM
php Look for method Web page

想用PHP的CURL采集虾米网的音乐信息。
但是很慢,采集到50个左右的时候就会停掉,然后网页卡住,第二次运行的时候就无法采集,应该是根据IP识别后,不允许采集了吧,所以基本上采集数据非常慢。
请问这种大数据的采集应该怎么做?
也有可能是我代码的问题。
以下是部分代码。

$j=0;	//起始ID	$id = 200000;	//采集1000条	//保存采集的数据	$data = array();	while($j<1000){		$url = 'http://www.xiami.com/song/'.($id++);		$ch = curl_init();		$status = curl_getinfo($ch);		///$status['redirect_url'] ;// 跳转到的新地址		$header[]='Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8';		$header[]='Accept-Encoding:gzip,deflate,sdch';		$header[]='Accept-Language:zh-CN,zh;q=0.8';		$header[]='Cache-Control:max-age=0';		$header[]='Connection:keep-alive';		$header[]='Cookie:_unsign_token=a35437bd35c221c09a0e6f564e17c225; __gads=ID=7fcc242f6fd63d77:T=1408774454:S=ALNI_Mae8MH6vL5z6q4NlGYzyqgD4jHeEg; bdshare_firstime=1408774454639; _xiamitoken=3541aab48832ba3ceb089de7f39b9b0f; pnm_cku822=211n%2BqZ9mgNqgJnCG0Zu8%2BzyLTPuc%2B7wbrff98%3D%7CnOiH84T3jPCG%2FIr%2BiPOG8lI%3D%7CneiHGXz6UeRW5k4rRCFXIkcoTdd7ym3fZdO2FrY%3D%7Cmu6b9JHlkuGa5pDqnOie5ZDkmeqb4ZTule6V7ZjjlOib7JrmkvdX%7Cm%2B%2BT%2FGIUew96DXsUYBd4HawbrTOXOVI4iyOLIYUqT%2B9P%7CmO6BH2wDcB9rHGsYdwRrH2gfbAN%2FDH8QZBNkF3gDeQqqCg%3D%3D%7Cme6d7oHyneiH84Twn%2BmR64TzUw%3D%3D; CNZZDATA921634=cnzz_eid%3D1437506062-1408774274-%26ntime%3D1408937320; CNZZDATA2629111=cnzz_eid%3D2021816723-1408774274-%26ntime%3D1408937320; isg=075E6FBDF77039CEB63A1BA239420244';		$header[]='Host:www.xiami.com';		$header[]='User-Agent:Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1653.0 Safari/537.36';		curl_setopt($ch, CURLOPT_URL, $url);	//要访问的地址		curl_setopt($ch, CURLOPT_HTTPHEADER, $header);	//设置http头		curl_setopt($ch, CURLOPT_HEADER, 0);	//显示返回的Header区域内容		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);	//获取的信息以文件流的形式返回		curl_setopt($ch, CURLOPT_TIMEOUT, 20);	//设置超时限制防止死循环		$content = curl_exec($ch);	//执行操作		$curl_errno = curl_errno($ch);		$curl_error = curl_error($ch);		curl_close($ch);	//关闭CURL会话		preg_match('/name="description"\s+content="《(.+)》演唱者(.+),所属专辑《(.+)》/', $content,$matches);		//如果歌曲名字为空,跳过		if(empty($matches[1]) || trim($matches[1]) == ''){			continue;		}				//匹配出的数据		$data[$id]['song'] = empty($matches[1])?' ':$matches[1];		$data[$id]['songer'] = empty($matches[2])?' ':$matches[2];		$data[$id]['album'] = empty($matches[3])?' ':$matches[3];				preg_match('/album\/(\d+)/', $content,$matches);		$data[$id]['albumId'] = empty($matches[1])?0:$matches[1];		preg_match('/\/artist\/(\d+)/', $content,$matches);		$data[$id]['songerId'] = empty($matches[1])?0:$matches[1];		//歌词<div class="lrc_main">		preg_match('/<div class="lrc_main">(.*)<\/div>/Us', $content,$matches);		$data[$id]['lrc'] =  empty($matches[1])?' ':addslashes($matches[1]);		//分享 分享<em>(3269)</em>		preg_match('/分享<em>\((\d+)\)<\/em>/Us', $content,$matches);		$data[$id]['share'] =  empty($matches[1]) ? 0:$matches[1];		//评论次数 <p class="wall_list_count"><span>920		preg_match('/<p class="wall_list_count"><span>(\d+)<\/span>/Us', $content,$matches);		$data[$id]['comment_count'] =  empty($matches[1])?0:$matches[1];		//入库操作		//print_r($data);		//_____________________________		$j++;		usleep(3000);	}
Copy after login





回复讨论(解决方案)

亲,用snoopy类吧

亲用 Ruby 或者 Go 吧

开玩笑,就算你要跑好歹你也弄成命令行的模式跑呀....

应该是xiami.com服务器有限制,禁止采集吧

1,每个url请求只采10-20打,然后做个跳转在继续采集,这样也可以防止页面超时,如果你在虚机上运行,长时间点用cpu,进程可能会被kill.

2,每次url请求header中的user-agent,cookies,最好都能改一下。

3,如果还不行,用火车头试试吧!

4,如果火车也不行,那就放弃这个站吧!

把foreach拆分成循环执行同一页面。
第一次浏览器或者cronrab定时执行 http://localhost/caiji.php?num=1 每次完成后,$_GET['num']+1;curl 重复l执行同一脚本,当$_GET['num']==1000后,退出,不再执行curl。

if($_GET['num']){$url = 'http://www.xiami.com/song/'.$_GET['num'];//你的代码$_GET['num'])++;}if($_GET['num']<1001){        $ch = curl_init();	curl_setopt($ch, CURLOPT_URL,"http://localhost/caiji.php?num=".$_GET['num']));	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);	curl_setopt($ch, CURLOPT_CONNECTTIMEOUT ,2); 	curl_setopt($ch, CURLOPT_TIMEOUT ,2);	curl_exec($ch);	curl_close($ch);}else{   exit;}
Copy after login


Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

CakePHP Project Configuration CakePHP Project Configuration Sep 10, 2024 pm 05:25 PM

In this chapter, we will understand the Environment Variables, General Configuration, Database Configuration and Email Configuration in CakePHP.

PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian Dec 24, 2024 pm 04:42 PM

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

CakePHP Date and Time CakePHP Date and Time Sep 10, 2024 pm 05:27 PM

To work with date and time in cakephp4, we are going to make use of the available FrozenTime class.

CakePHP File upload CakePHP File upload Sep 10, 2024 pm 05:27 PM

To work on file upload we are going to use the form helper. Here, is an example for file upload.

Discuss CakePHP Discuss CakePHP Sep 10, 2024 pm 05:28 PM

CakePHP is an open-source framework for PHP. It is intended to make developing, deploying and maintaining applications much easier. CakePHP is based on a MVC-like architecture that is both powerful and easy to grasp. Models, Views, and Controllers gu

CakePHP Routing CakePHP Routing Sep 10, 2024 pm 05:25 PM

In this chapter, we are going to learn the following topics related to routing ?

CakePHP Working with Database CakePHP Working with Database Sep 10, 2024 pm 05:25 PM

Working with database in CakePHP is very easy. We will understand the CRUD (Create, Read, Update, Delete) operations in this chapter.

CakePHP Creating Validators CakePHP Creating Validators Sep 10, 2024 pm 05:26 PM

Validator can be created by adding the following two lines in the controller.

See all articles