(转)php抓取网页内容集锦
(转)php抓取网页内容汇总
①、使用php获取网页内容
http://hi.baidu.com/quqiufeng/blog/item/7e86fb3f40b598c67d1e7150.html
header("Content-type: text/html; charset=utf-8");
1、
$xhr = new COM("MSXML2.XMLHTTP");
$xhr->open("GET","http://localhost/xxx.php?id=2",false);
$xhr->send();
echo $xhr->responseText
2、file_get_contents实现
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>
3、fopen()实现
if ($stream = fopen('http://www.sohu.com', 'r')) {
??? // print all the page starting at the offset 10
??? echo stream_get_contents($stream, -1, 10);
??? fclose($stream);
}
if ($stream = fopen('http://www.sohu.net', 'r')) {
??? // print the first 5 bytes
??? echo stream_get_contents($stream, 5);
??? fclose($stream);
}
?>
②、使用php获取网页内容
http://www.blogjava.net/pts/archive/2007/08/26/99188.html
简单的做法:
$url="http://www.blogjava.net/pts";
echo file_get_contents( $url );
?>
或者:
if ($stream = fopen('http://www.sohu.com', 'r')) {
??? // print all the page starting at the offset 10
??? echo stream_get_contents($stream, -1, 10);
??? fclose($stream);
}
if ($stream = fopen('http://www.sohu.net', 'r')) {
??? // print the first 5 bytes
??? echo stream_get_contents($stream, 5);
??? fclose($stream);
}
?>
③、PHP获取网站内容,保存为TXT文件源码
http://blog.chinaunix.net/u1/44325/showart_348444.html
$my_book_url='http://book.yunxiaoge.com/files/article/html/4/4550/index.html';
ereg("http://book.yunxiaoge.com/files/article/html/[0-9]+/[0-9]+/",$my_book_url,$myBook);
$my_book_txt=$myBook[0];
$file_handle = fopen($my_book_url, "r");//读取文件
unlink("test.txt");
while (!feof($file_handle)) { //循环到文件结束
??? $line = fgets($file_handle); //读取一行文件
??? $line1=ereg("href=\"[0-9]+.html",$line,$reg); //分析文件内部书的文章页面
?????? $handle = fopen("test.txt", 'a');
?? if ($line1) {
???? $my_book_txt_url=$reg[0]; //另外赋值,给抓取分析做准备
?? $my_book_txt_url=str_replace("href=\"","",$my_book_txt_url);
????? $my_book_txt_over_url="$my_book_txt$my_book_txt_url"; //转换为抓取地址
????? echo "$my_book_txt_over_url
????? $file_handle_txt = fopen($my_book_txt_over_url, "r"); //读取转换后的抓取地址
????? while (!feof($file_handle_txt)) {
?????? $line_txt = fgets($file_handle_txt);
?????? $line1=ereg("^ .+",$line_txt,$reg); //根据抓取内容标示抓取
?????? $my_over_txt=$reg[0];
?????? $my_over_txt=str_replace(" ","??? ",$my_over_txt); //过滤字符
?????? $my_over_txt=str_replace("
","",$my_over_txt);
?????? $my_over_txt=str_replace("
?????? $my_over_txt=str_replace(""","",$my_over_txt);
?????? if ($line1) {
???????? $handle1=fwrite($handle,"$my_over_txt\n"); //写入文件
?????? }
????? }
??? }
}
fclose($file_handle_txt);
fclose($handle);
fclose($file_handle); //关闭文件
echo "完成";
?>
下面是比较嚣张的方法。
这里使用一个名叫Snoopy的类。
先是在这里看到的:
PHP中获取网页内容的Snoopy包
http://blog.declab.com/read.php/27.htm
然后是Snoopy的官网:
http://sourceforge.net/projects/snoopy/
这里有一些简单的说明:
代码收藏-Snoopy类及简单的使用方法
http://blog.passport86.com/?p=161
下载:http://sourceforge.net/projects/snoopy/
今天才发现这个好东西,赶紧去下载了来看看,是用的parse_url
还是比较习惯curl
snoopy是一个php类,用来模仿web浏览器的功能,它能完成获取网页内容和发送表单的任务。
下面是它的一些特征:
1、方便抓取网页的内容
2、方便抓取网页的文字(去掉HTML代码)
3、方便抓取网页的链接
4、支持代理主机
5、支持基本的用户/密码认证模式
6、支持自定义用户agent,referer,cookies和header内容
7、支持浏览器转向,并能控制转向深度
8、能把网页中的链接扩展成高质量的url(默认)
9、方便提交数据并且获取返回值
10、支持跟踪HTML框架(v0.92增加)
11、支持再转向的时候传递cookies
具体使用请看下载文件中的说明。
include“Snoopy.class.php“;
$snoopy=newSnoopy;
$snoopy->fetchform(“http://www.phpx.com/happy/logging.php?action=login“);
print$snoopy->results;
?>
include“Snoopy.class.php“;
$snoopy=newSnoopy;
$submit_url=“http://www.phpx.com/happy/logging.php?action=login“;$submit_vars["loginmode"]=“normal“;
$submit_vars["styleid"]=“1“;
$submit_vars["cookietime"]=“315360000“;
$submit_vars["loginfield"]=“username“;
$submit_vars["username"]=“********“;//你的用户名
$submit_vars["password"]=“*******“;//你的密码
$submit_vars["questionid"]=“0“;
$submit_vars["answer"]=“”;
$submit_vars["loginsubmit"]=“提 交“;
$snoopy->submit($submit_url,$submit_vars);
print$snoopy->results;?>
下面是Snoopy的Readme
NAME:
??? Snoopy - the PHP net client v1.2.4
???
SYNOPSIS:
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
???
??? $snoopy->fetchtext("http://www.php.net/");
??? print $snoopy->results;
???
??? $snoopy->fetchlinks("http://www.phpbuilder.com/");
??? print $snoopy->results;
???
??? $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
???
??? $submit_vars["q"] = "amiga";
??? $submit_vars["submit"] = "Search!";
??? $submit_vars["searchhost"] = "Altavista";
??? ???
??? $snoopy->submit($submit_url,$submit_vars);
??? print $snoopy->results;
???
??? $snoopy->maxframes=5;
??? $snoopy->fetch("http://www.ispi.net/");
??? echo "
\n";<br>??? echo htmlentities($snoopy->results[0]);<br>??? echo htmlentities($snoopy->results[1]);<br>??? echo htmlentities($snoopy->results[2]);<br>??? echo "
??? $snoopy->fetchform("http://www.altavista.com");
??? print $snoopy->results;
DESCRIPTION:
??? What is Snoopy?
???
??? Snoopy is a PHP class that simulates a web browser. It automates the
??? task of retrieving web page content and posting forms, for example.
??? Some of Snoopy's features:
???
??? * easily fetch the contents of a web page
??? * easily fetch the text from a web page (strip html tags)
??? * easily fetch the the links from a web page
??? * supports proxy hosts
??? * supports basic user/pass authentication
??? * supports setting user_agent, referer, cookies and header content
??? * supports browser redirects, and controlled depth of redirects
??? * expands fetched links to fully qualified URLs (default)
??? * easily submit form. data and retrieve the results
??? * supports following html frames (added v0.92)
??? * supports passing cookies on redirects (added v0.92)
???
???
REQUIREMENTS:
??? Snoopy requires PHP with PCRE (Perl Compatible Regular Expressions),
??? which should be PHP 3.0.9 and up. For read timeout support, it requires
??? PHP 4 Beta 4 or later. Snoopy was developed and tested with PHP 3.0.12.
CLASS METHODS:
??? fetch($URI)
??? -----------
???
??? This is the method used for fetching the contents of a web page.
??? $URI is the fully qualified URL of the page to fetch.
??? The results of the fetch are stored in $this->results.
??? If you are fetching frames, then $this->results
??? contains each frame. fetched in an array.
??? ???
??? fetchtext($URI)
??? ---------------???
???
??? This behaves exactly like fetch() except that it only returns
??? the text from the page, stripping out html tags and other
??? irrelevant data.??? ???
??? fetchform($URI)
??? ---------------???
???
??? This behaves exactly like fetch() except that it only returns
??? the form. elements from the page, stripping out html tags and other
??? irrelevant data.??? ???
??? fetchlinks($URI)
??? ----------------
??? This behaves exactly like fetch() except that it only returns
??? the links from the page. By default, relative links are
??? converted to their fully qualified URL form.
??? submit($URI,$formvars)
??? ----------------------
???
??? This submits a form. to the specified $URI. $formvars is an
??? array of the form. variables to pass.
??? ???
??? ???
??? submittext($URI,$formvars)
??? --------------------------
??? This behaves exactly like submit() except that it only returns
??? the text from the page, stripping out html tags and other
??? irrelevant data.??? ???
??? submitlinks($URI)
??? ----------------
??? This behaves exactly like submit() except that it only returns
??? the links from the page. By default, relative links are
??? converted to their fully qualified URL form.
CLASS VARIABLES:??? (default value in parenthesis)
??? $host??? ??? ??? the host to connect to
??? $port??? ??? ??? the port to connect to
??? $proxy_host??? ??? the proxy host to use, if any
??? $proxy_port??? ??? the proxy port to use, if any
??? $agent??? ??? ??? the user agent to masqerade as (Snoopy v0.1)
??? $referer??? ??? referer information to pass, if any
??? $cookies??? ??? cookies to pass if any
??? $rawheaders??? ??? other header info to pass, if any
??? $maxredirs??? ??? maximum redirects to allow. 0=none allowed. (5)
??? $offsiteok??? ??? whether or not to allow redirects off-site. (true)
??? $expandlinks??? whether or not to expand links to fully qualified URLs (true)
??? $user??? ??? ??? authentication username, if any
??? $pass??? ??? ??? authentication password, if any
??? $accept??? ??? ??? http accept types (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
??? $error??? ??? ??? where errors are sent, if any
??? $response_code??? responde code returned from server
??? $headers??? ??? headers returned from server
??? $maxlength??? ??? max return data length
??? $read_timeout??? timeout on read operations (requires PHP 4 Beta 4+)
??? ??? ??? ??? ??? set to 0 to disallow timeouts
??? $timed_out??? ??? true if a read operation timed out (requires PHP 4 Beta 4+)
??? $maxframes??? ??? number of frames we will follow
??? $status??? ??? ??? http status of fetch
??? $temp_dir??? ??? temp directory that the webserver can write to. (/tmp)
??? $curl_path??? ??? system path to cURL binary, set to false if none
???
EXAMPLES:
??? Example: ??? fetch a web page and display the return headers and
??? ??? ??? ??? the contents of the page (html-escaped):
???
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
???
??? $snoopy->user = "joe";
??? $snoopy->pass = "bloe";
???
??? if($snoopy->fetch("http://www.slashdot.org/"))
??? {
??? ??? echo "response code: ".$snoopy->response_code."
\n";
??? ??? while(list($key,$val) = each($snoopy->headers))
??? ??? ??? echo $key.": ".$val."
\n";
??? ??? echo "
\n";
??? ???
??? ??? echo "
".htmlspecialchars($snoopy->results)."
??? }
??? else
??? ??? echo "error fetching document: ".$snoopy->error."\n";
??? Example:??? submit a form. and print out the result headers
??? ??? ??? ??? and html-escaped page:
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
???
??? $submit_url = "http://lnk.ispi.net/texis/scripts/msearch/netsearch.html";
???
??? $submit_vars["q"] = "amiga";
??? $submit_vars["submit"] = "Search!";
??? $submit_vars["searchhost"] = "Altavista";
??? ???
??? if($snoopy->submit($submit_url,$submit_vars))
??? {
??? ??? while(list($key,$val) = each($snoopy->headers))
??? ??? ??? echo $key.": ".$val."
\n";
??? ??? echo "
\n";
??? ???
??? ??? echo "
".htmlspecialchars($snoopy->results)."
??? }
??? else
??? ??? echo "error fetching document: ".$snoopy->error."\n";
??? Example:??? showing functionality of all the variables:
???
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
??? $snoopy->proxy_host = "my.proxy.host";
??? $snoopy->proxy_port = "8080";
???
??? $snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";
??? $snoopy->referer = "http://www.microsnot.com/";
???
??? $snoopy->cookies["SessionID"] = 238472834723489l;
??? $snoopy->cookies["favoriteColor"] = "RED";
???
??? $snoopy->rawheaders["Pragma"] = "no-cache";
???
??? $snoopy->maxredirs = 2;
??? $snoopy->offsiteok = false;
??? $snoopy->expandlinks = false;
???
??? $snoopy->user = "joe";
??? $snoopy->pass = "bloe";
???
??? if($snoopy->fetchtext("http://www.phpbuilder.com"))
??? {
??? ??? while(list($key,$val) = each($snoopy->headers))
??? ??? ??? echo $key.": ".$val."
\n";
??? ??? echo "
\n";
??? ???
??? ??? echo "
".htmlspecialchars($snoopy->results)."
??? }
??? else
??? ??? echo "error fetching document: ".$snoopy->error."\n";
??? Example: ??? fetched framed content and display the results
???
??? include "Snoopy.class.php";
??? $snoopy = new Snoopy;
???
??? $snoopy->maxframes = 5;
???
??? if($snoopy->fetch("http://www.ispi.net/"))
??? {
??? ??? echo "
".htmlspecialchars($snoopy->results[0])."
??? ??? echo "
".htmlspecialchars($snoopy->results[1])."
??? ??? echo "
".htmlspecialchars($snoopy->results[2])."
??? }
??? else
??? ??? echo "error fetching document: ".$snoopy->error."\n";
?
?
<?php //获取所有内容url保存到文件function get_index($save_file, $prefix="index_"){ $count = 68; $i = 1; if (file_exists($save_file)) @unlink($save_file); $fp = fopen($save_file, "a+") or die("Open ". $save_file ." failed"); while($i<$count){ $url = $prefix . $i .".htm"; echo "Get ". $url ."..."; $url_str = get_content_url(get_url($url)); echo " OKn"; fwrite($fp, $url_str); ++$i; } fclose($fp);}//获取目标多媒体对象function get_object($url_file, $save_file, $split="|--:**:--|"){ if (!file_exists($url_file)) die($url_file ." not exist"); $file_arr = file($url_file); if (!is_array($file_arr) || empty($file_arr)) die($url_file ." not content"); $url_arr = array_unique($file_arr); if (file_exists($save_file)) @unlink($save_file); $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed"); foreach($url_arr as $url){ if (empty($url)) continue; echo "Get ". $url ."..."; $html_str = get_url($url); echo $html_str; echo $url; exit; $obj_str = get_content_object($html_str); echo " OKn"; fwrite($fp, $obj_str); } fclose($fp);}//遍历目录获取文件内容function get_dir($save_file, $dir){ $dp = opendir($dir); if (file_exists($save_file)) @unlink($save_file); $fp = fopen($save_file, "a+") or die("Open save file ". $save_file ." failed"); while(($file = readdir($dp)) != false){ if ($file!="." && $file!=".."){ echo "Read file ". $file ."..."; $file_content = file_get_contents($dir . $file); $obj_str = get_content_object($file_content); echo " OKn"; fwrite($fp, $obj_str); } } fclose($fp);}//获取指定url内容function get_url($url){ $reg = '/^http://[^/].+$/'; if (!preg_match($reg, $url)) die($url ." invalid"); $fp = fopen($url, "r") or die("Open url: ". $url ." failed."); while($fc = fread($fp, 8192)){ $content .= $fc; } fclose($fp); if (empty($content)){ die("Get url: ". $url ." content failed."); } return $content;}//使用socket获取指定网页function get_content_by_socket($url, $host){ $fp = fsockopen($host, 80) or die("Open ". $url ." failed"); $header = "GET /".$url ." HTTP/1.1rn"; $header .= "Accept: */*rn"; $header .= "Accept-Language: zh-cnrn"; $header .= "Accept-Encoding: gzip, deflatern"; $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)rn"; $header .= "Host: ". $host ."rn"; $header .= "Connection: Keep-Alivern"; //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-rnrn"; $header .= "Connection: Closernrn"; fwrite($fp, $header); while (!feof($fp)) { $contents .= fgets($fp, 8192); } fclose($fp); return $contents;}//获取指定内容里的urlfunction get_content_url($host_url, $file_contents){ //$reg = '/^(#|javascript.*?|ftp://.+|http://.+|.*?href.*?|play.*?|index.*?|.*?asp)+$/i'; //$reg = '/^(down.*?.html|d+_d+.htm.*?)$/i'; $rex = "/([hH][rR][eE][Ff])s*=s*['"]*([^>'"s]+)["'>]*s*/i"; $reg = '/^(down.*?.html)$/i'; preg_match_all ($rex, $file_contents, $r); $result = ""; //array(); foreach($r as $c){ if (is_array($c)){ foreach($c as $d){ if (preg_match($reg, $d)){ $result .= $host_url . $d."n"; } } } } return $result;}//获取指定内容中的多媒体文件function get_content_object($str, $split="|--:**:--|"){ $regx = "/hrefs*=s*['"]*([^>'"s]+)["'>]*s*(<b>.*?</b>)/i"; preg_match_all($regx, $str, $result); if (count($result) == 3){ $result[2] = str_replace("<b>多媒体: ", "", $result[2]); $result[2] = str_replace("</b>", "", $result[2]); $result = $result[1][0] . $split .$result[2][0] . "n"; } return $result;}?>
php抓取网页特定div区块及图片
(2009-06-05 09:56:23)
标签:php抓取图片it |
分类: PHP |
1. 取得指定網頁內的所有圖片:
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/');
//取得第一個img標籤,並儲存至陣列match(regex語法與上述同義)
preg_match('/]*>/Ui',$text, $match);
//印出match
print_r($match);
?>
-----------------
2. 取得指定網頁內的第一張圖片:
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/');
//取得第一個img標籤,並儲存至陣列match(regex語法與上述同義)
preg_match('/]*>/Ui',$text, $match);
//印出match
print_r($match);
?>
------------------------------------
3. 取得指定網頁內的特定div區塊(藉由id判斷):
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');
//去除換行及空白字元(序列化內容才需使用)
//$text=str_replace(array("\r","\n","\t","\s"),'', $text);? ?
//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/
//印出match[0]
print($match[0]);
?>
-------------------------------------------
4. 上述2及3的結合:
//取得指定位址的內容,並儲存至text
$text=file_get_contents('http://andy.diimii.com/2009/01/seo%e5%8c%96%e7%9a%84%e9%97%9c%e9%8d%b5%e5%ad%97%e5%bb%a3%e5%91%8a%e9%80%a3%e7%b5%90/');???
//取出div標籤且id為PostContent的內容,並儲存至陣列match
preg_match('/
//取得第一個img標籤,並儲存至陣列match2
preg_

熱AI工具

Undresser.AI Undress
人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover
用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool
免費脫衣圖片

Clothoff.io
AI脫衣器

Video Face Swap
使用我們完全免費的人工智慧換臉工具,輕鬆在任何影片中換臉!

熱門文章

熱工具

記事本++7.3.1
好用且免費的程式碼編輯器

SublimeText3漢化版
中文版,非常好用

禪工作室 13.0.1
強大的PHP整合開發環境

Dreamweaver CS6
視覺化網頁開發工具

SublimeText3 Mac版
神級程式碼編輯軟體(SublimeText3)

許多用戶在選擇智慧型手錶的時候都會選擇的華為的品牌,其中華為GT3pro和GT4都是非常熱門的選擇,不少用戶都很好奇華為GT3pro和GT4有什麼區別,下面就給大家介紹一下二者。華為GT3pro和GT4有什麼差別一、外觀GT4:46mm和41mm,材質是玻璃鏡板+不鏽鋼機身+高分纖維後殼。 GT3pro:46.6mm和42.9mm,材質是藍寶石玻璃鏡+鈦金屬機身/陶瓷機身+陶瓷後殼二、健康GT4:採用最新的華為Truseen5.5+演算法,結果會更加的精準。 GT3pro:多了ECG心電圖和血管及安

報錯的原因NameResolutionError(self.host,self,e)frome是由urllib3函式庫中的例外類型,這個錯誤的原因是DNS解析失敗,也就是說,試圖解析的主機名稱或IP位址無法找到。這可能是由於輸入的URL位址不正確,或DNS伺服器暫時無法使用所導致的。如何解決解決此錯誤的方法可能有以下幾種:檢查輸入的URL地址是否正確,確保它是可訪問的確保DNS伺服器可用,您可以嘗試在命令行中使用"ping"命令來測試DNS伺服器是否可用嘗試使用IP位址而不是主機名稱來存取網站如果是在代理

php blob轉file的方法:1.建立一個php範例檔;2、透過「function blobToFile(blob) {return new File([blob], 'screenshot.png', { type: 'image/jpeg' })} 」方法實作Blob轉File即可。

使用Java的File.length()函數取得檔案的大小檔案大小是在處理檔案作業時很常見的一個需求,Java提供了一個很方便的方法來取得檔案的大小,即使用File類別的length()方法。本文將介紹如何使用此方法來取得檔案的大小,並給出對應的程式碼範例。首先,我們需要建立一個File物件來表示我們想要取得大小的檔案。以下是建立File物件的方法:Filef

PHP函數介紹—get_headers():取得URL的回應頭資訊概述:在PHP開發中,我們經常需要取得網頁或遠端資源的回應頭資訊。 PHP函數get_headers()能夠方便地取得目標URL的回應頭訊息,並以陣列形式傳回。本文將介紹get_headers()函數的用法,以及提供一些相關的程式碼範例。 get_headers()函數的用法:get_header

想了解更多關於開源的內容,請造訪:51CTO鴻蒙開發者社群https://ost.51cto.com運行環境DAYU200:4.0.10.16SDK:4.0.10.15IDE:4.0.600一、建立應用程式點擊File- >newFile->CreateProgect。選擇模版:【OpenHarmony】EmptyAbility:填寫項目名,shici,應用包名com.nut.shici,應用儲存位置XXX(不要有中文,特殊字符,空格)。 CompileSDK10,Model:Stage。 Device

區別:1、定義不同,url是是統一資源定位符,而html是超文本標記語言;2、一個html中可以有很多個url,而一個url中只能存在一個html頁面;3、html指的是網頁,而url指的是網站位址。

使用Java的File.renameTo()函數重命名檔案在Java程式設計中,我們經常需要對檔案進行重命名的操作。 Java提供了File類別來處理檔案操作,其中的renameTo()函數可以方便地重新命名檔案。本文將介紹如何使用Java的File.renameTo()函數來重新命名文件,並提供對應的程式碼範例。 File.renameTo()函數是File類別的一個方法,
