Home > php教程 > php手册 > PHP Cookbook读书笔记 – 第13章Web自动化

PHP Cookbook读书笔记 – 第13章Web自动化

WBOY
Release: 2016-06-06 19:40:53
Original
1278 people have browsed it

通过GET获得一个指定url的页面内容 有3种方式来获取一个URL的内容: PHP提供的文件函数file_get_contents() cURL扩展 PEAR中的HTTP_Request类 //方式1$page = file_get_contents('http://www.example.com/robots.txt');//方式2$c = curl_init('http://www.ex

PHP Cookbook读书笔记 – 第13章Web自动化通过GET获得一个指定url的页面内容

有3种方式来获取一个URL的内容:

  1. PHP提供的文件函数file_get_contents()
  2. cURL扩展
  3. PEAR中的HTTP_Request类
//方式1
$page = file_get_contents('http://www.example.com/robots.txt');

//方式2
$c = curl_init('http://www.example.com/robots.txt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($c);
curl_close($c);

//方式3
require_once 'HTTP/Request.php';
$r = new HTTP_Request('http://www.example.com/robots.txt');
$r->sendRequest();
$page = $r->getResponseBody();
Copy after login

可以通过这些方式来获取XML文档,通过结合http_build_query()来建立一个查询字符串,可以通过url中加入username@password的形式来访问受保护的页面,通过cURL和PEAR的HTTP_Client类来跟踪重定向。

通过POST获得一个URL

让PHP模拟发送一个POST请求并获得服务器的反馈内容

//1
$url = 'http://www.example.com/submit.php';
$body = 'monkey=uncle&rhino=aunt';
$options = array('method' => 'POST', 'content' => $body);
$context = stream_context_create(array('http' => $options));
print file_get_contents($url, false, $context);

//2
$url = 'http://www.example.com/submit.php';
$body = 'monkey=uncle&rhino=aunt';
$c = curl_init($url);
curl_setopt($c, CURLOPT_POST, true);
curl_setopt($c, CURLOPT_POSTFIELDS, $body);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($c);
curl_close($c);

//3
require 'HTTP/Request.php';
$url = 'http://www.example.com/submit.php';
$r = new HTTP_Request($url);
$r->setMethod(HTTP_REQUEST_METHOD_POST);
$r->addPostData('monkey','uncle');
$r->addPostData('rhino','aunt');
$r->sendRequest();
$page = $r->getResponseBody();
Copy after login

通过Cookie获得一个URL

//2
$c = curl_init('http://www.example.com/needs-cookies.php');
curl_setopt($c, CURLOPT_COOKIE, 'user=ellen; activity=swimming');
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($c);
curl_close($c);

//3
require 'HTTP/Request.php';
$r = new HTTP_Request('http://www.example.com/needs-cookies.php');
$r->addHeader('Cookie','user=ellen; activity=swimming');
$r->sendRequest();
$page = $r->getResponseBody();
Copy after login

通过Header获得一个URL

通过修改header中的信息可以来伪造 Referer 或 User-Agent 后请求目标URL,不少防盗链网站经常会采用判断Referer中的信息来源决定是否允许下载或访问资源。需要具备一些HTTP的HEADER背景知识。

标记网页

其实这个代码经过简单修改还可以应用到替换网页中的敏感关键字,这在天朝是很有用的一个功能

$body = '
Copy after login

I like pickles and herring.

<img  src="/inc/test.jsp?url=http%3A%2F%2Fwww.cnblogs.com%2FExcellent%2Fadmin%2Fpickle.jpg&refer=http%3A%2F%2Fwww.cnblogs.com%2FExcellent%2Farchive%2F2011%2F11%2F25%2F2262978.html" alt="PHP Cookbook读书笔记 – 第13章Web自动化" >A pickle picture

I have a herringbone-patterned toaster cozy.

Herring is not a real HTML element!
';

$words = array('pickle','herring');
$patterns = array();
$replacements = array();
foreach ($words as $i => $word) {
    $patterns[] = '/' . preg_quote($word) .'/i';
    $replacements[] = "<span>\\0</span>";
}

// Split up the page into chunks delimited by a
// reasonable approximation of what an HTML element
// looks like.
$parts = preg_split("{(])*>)}",
                    $body,
                    -1,  // Unlimited number of chunks
                    PREG_SPLIT_DELIM_CAPTURE);
foreach ($parts as $i => $part) {
    // Skip if this part is an HTML element
    if (isset($part[0]) && ($part[0] == 's
    $parts[$i] = preg_replace($patterns, $replacements, $part);
}

// Reconstruct the body
$body = implode('',$parts);

print $body;
Copy after login

提取页面所有链接

也是一个很不错的功能,在做采集之类的程序时可以用的上

采用了tidy扩展的实现方式:

$doc = new DOMDocument();
$opts = array('output-xml' => true,
              // Prevent DOMDocument from being confused about entities
              'numeric-entities' => true);
$doc->loadXML(tidy_repair_file('linklist.html',$opts));
$xpath = new DOMXPath($doc);
// Tell $xpath about the XHTML namespace
$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');
foreach ($xpath->query('//xhtml:a/@href') as $node) {
    $link = $node->nodeValue;
    print $link . "\n";
Copy after login

通过正则提取链接:

$html = file_get_contents('linklist.html');
$links = pc_link_extractor($html);
foreach ($links as $link) {
    print $link[0] . "\n";
}

function pc_link_extractor($html) {
    $links = array();
    preg_match_all('/]*)[\"\']?[^>]*>(.*?)/i', $html,$matches,PREG_SET_ORDER); foreach($matches as $match) { $links[] = array($match[1],$match[2]); } return $links;
Copy after login

将文本转换为HTML

bbcode的概念和这个很像,所以将这个贴出来

function pc_text2html($s) {
  $s = htmlentities($s);
  $grafs = split("\n\n",$s);
  for ($i = 0, $j = count($grafs); $i 
<p>'.$grafs[$i].'</p>
<pre class="brush:php;toolbar:false">';  }  return implode("\n\n",$grafs);}
Copy after login

将HTML转换为文本

已经有现成的代码来实现这个功能http://www.chuggnutt.com/html2text.php

用这个函数strip_tags( ) 可以

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Recommendations
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template