Home php教程 php手册 curl data collection series single page collection function get_html

curl data collection series single page collection function get_html

Jul 21, 2016 pm 02:52 PM

When doing data collection, we often use curl+ to collect the required data in a regular way. Based on my own work experience, I will share some common custom functions I wrote in the blog garden. If If there is something inappropriate in my writing, please give me some advice

This is a series and there is no way to finish it in one or two days, so I will publish it one by one

Rough outline:

1.curlSingle page collection function of data collection seriesget_html

2.curlMulti-page parallel collection function of data collection seriesget_htmls

3.curlRegular processing function of data collection seriesget _matches

4.curlCode separation of data collection series

5.curlParallel logic control function of data acquisition seriesweb_spider

,,,

Single page collection is the most commonly used function in the data collection process. Sometimes under server access restrictions, this collection method can only be used. It is slow but can be easily controlled, so write a commonly used curlFunction calling is very important

We are familiar with Baidu and NetEase, so we will use the collection of homepages of these two websites as examples

The simplest way to write:

curl data collection series single page collection function get_html
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.baidu.com'<span style="color: #000000;">;
</span><span style="color: #008080;">2</span> <span style="color: #800080;">$ch</span> = curl_init(<span style="color: #800080;">$url</span><span style="color: #000000;">);
</span><span style="color: #008080;">3</span> curl_setopt(<span style="color: #800080;">$ch</span>,CURLOPT_RETURNTRANSFER,<span style="color: #0000ff;">true</span><span style="color: #000000;">);
</span><span style="color: #008080;">4</span> curl_setopt(<span style="color: #800080;">$ch</span>,CURLOPT_TIMEOUT,5<span style="color: #000000;">);
</span><span style="color: #008080;">5</span> <span style="color: #800080;">$html</span> = curl_exec(<span style="color: #800080;">$ch</span><span style="color: #000000;">);
</span><span style="color: #008080;">6</span> <span style="color: #0000ff;">if</span>(<span style="color: #800080;">$html</span> !== <span style="color: #0000ff;">false</span><span style="color: #000000;">){
</span><span style="color: #008080;">7</span>     <span style="color: #0000ff;">echo</span> <span style="color: #800080;">$html</span><span style="color: #000000;">;
</span><span style="color: #008080;">8</span> }
Copy after login
curl data collection series single page collection function get_html

Due to frequent use, curl_setopt_array can be used to write it in the form of a function:

curl data collection series single page collection function get_html
<span style="color: #008080;"> 1</span> <span style="color: #0000ff;">function</span> get_html(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span> = <span style="color: #0000ff;">array</span><span style="color: #000000;">()){
</span><span style="color: #008080;"> 2</span>     <span style="color: #800080;">$options</span>[CURLOPT_RETURNTRANSFER] = <span style="color: #0000ff;">true</span><span style="color: #000000;">;
</span><span style="color: #008080;"> 3</span>     <span style="color: #800080;">$options</span>[CURLOPT_TIMEOUT] = 5<span style="color: #000000;">;
</span><span style="color: #008080;"> 4</span>     <span style="color: #800080;">$ch</span> = curl_init(<span style="color: #800080;">$url</span><span style="color: #000000;">);
</span><span style="color: #008080;"> 5</span>     curl_setopt_array(<span style="color: #800080;">$ch</span>,<span style="color: #800080;">$options</span><span style="color: #000000;">);
</span><span style="color: #008080;"> 6</span>     <span style="color: #800080;">$html</span> = curl_exec(<span style="color: #800080;">$ch</span><span style="color: #000000;">);
</span><span style="color: #008080;"> 7</span>     curl_close(<span style="color: #800080;">$ch</span><span style="color: #000000;">);
</span><span style="color: #008080;"> 8</span>     <span style="color: #0000ff;">if</span>(<span style="color: #800080;">$html</span> === <span style="color: #0000ff;">false</span><span style="color: #000000;">){
</span><span style="color: #008080;"> 9</span>         <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">false</span><span style="color: #000000;">;
</span><span style="color: #008080;">10</span> <span style="color: #000000;">    }
</span><span style="color: #008080;">11</span>     <span style="color: #0000ff;">return</span> <span style="color: #800080;">$html</span><span style="color: #000000;">;
</span><span style="color: #008080;">12</span> }
Copy after login
curl data collection series single page collection function get_html
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.baidu.com'<span style="color: #000000;">;
</span><span style="color: #008080;">2</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>);
Copy after login

Sometimes you need to pass some specific parameters to get the correct page. For example, if you want to get the NetEase page now:

<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">;
</span><span style="color: #008080;">2</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>);
Copy after login

You will see a blank with nothing, then use curl_getinfo to write a function and see what happens:

curl data collection series single page collection function get_html
<span style="color: #008080;"> 1</span> <span style="color: #0000ff;">function</span> get_info(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span> = <span style="color: #0000ff;">array</span><span style="color: #000000;">()){
</span><span style="color: #008080;"> 2</span>     <span style="color: #800080;">$options</span>[CURLOPT_RETURNTRANSFER] = <span style="color: #0000ff;">true</span><span style="color: #000000;">;
</span><span style="color: #008080;"> 3</span>     <span style="color: #800080;">$options</span>[CURLOPT_TIMEOUT] = 5<span style="color: #000000;">;
</span><span style="color: #008080;"> 4</span>     <span style="color: #800080;">$ch</span> = curl_init(<span style="color: #800080;">$url</span><span style="color: #000000;">);
</span><span style="color: #008080;"> 5</span>     curl_setopt_array(<span style="color: #800080;">$ch</span>,<span style="color: #800080;">$options</span><span style="color: #000000;">);
</span><span style="color: #008080;"> 6</span>     <span style="color: #800080;">$html</span> = curl_exec(<span style="color: #800080;">$ch</span><span style="color: #000000;">);
</span><span style="color: #008080;"> 7</span>     <span style="color: #800080;">$info</span> = curl_getinfo(<span style="color: #800080;">$ch</span><span style="color: #000000;">);
</span><span style="color: #008080;"> 8</span>     curl_close(<span style="color: #800080;">$ch</span><span style="color: #000000;">);
</span><span style="color: #008080;"> 9</span>     <span style="color: #0000ff;">return</span> <span style="color: #800080;">$info</span><span style="color: #000000;">;
</span><span style="color: #008080;">10</span> <span style="color: #000000;">}
</span><span style="color: #008080;">11</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">;
</span><span style="color: #008080;">12</span> <span style="color: #008080;">var_dump</span>(get_info(<span style="color: #800080;">$url</span>));
Copy after login
curl data collection series single page collection function get_html

You can see that http_code 302 is redirected. At this time, you need to pass some parameters:

<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">;
</span><span style="color: #008080;">2</span> <span style="color: #800080;">$options</span>[CURLOPT_FOLLOWLOCATION] = <span style="color: #0000ff;">true</span><span style="color: #000000;">;
</span><span style="color: #008080;">3</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span>);
Copy after login

You will find out why such a page is different from the one accessed by our computer? ? ?

It seems that the parameters are still not enough for the server to determine what device our client is on, so it returns a normal version

Looks like we have to send USERAGENT

<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">;
</span><span style="color: #008080;">2</span> <span style="color: #800080;">$options</span>[CURLOPT_FOLLOWLOCATION] = <span style="color: #0000ff;">true</span><span style="color: #000000;">;
</span><span style="color: #008080;">3</span> <span style="color: #800080;">$options</span>[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0'<span style="color: #000000;">;
</span><span style="color: #008080;">4</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span>);
Copy after login

OKNow the page has come out. Basically thisget_htmlfunction can basically achieve such extended functions

Of course, there are other ways to achieve it. When you clearly know the NetEase webpage, you can simply collect it:

<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com/index.html'<span style="color: #000000;">;
</span><span style="color: #008080;">2</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>);
Copy after login

This can also be collected normally

Today comes to an end byebye!!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Two Point Museum: All Exhibits And Where To Find Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)