When doing data collection, we often use curl+ to collect the required data in a regular way. Based on my own work experience, I will share some common custom functions I wrote in the blog garden. If If there is something inappropriate in my writing, please give me some advice
This is a series and there is no way to finish it in one or two days, so I will publish it one by one
Rough outline:
1.curlSingle page collection function of data collection seriesget_html
2.curlMulti-page parallel collection function of data collection seriesget_htmls
3.curlRegular processing function of data collection seriesget _matches
4.curlCode separation of data collection series
5.curlParallel logic control function of data acquisition seriesweb_spider
,,,
Single page collection is the most commonly used function in the data collection process. Sometimes under server access restrictions, this collection method can only be used. It is slow but can be easily controlled, so write a commonly used curlFunction calling is very important
We are familiar with Baidu and NetEase, so we will use the collection of homepages of these two websites as examples
The simplest way to write:
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.baidu.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #800080;">$ch</span> = curl_init(<span style="color: #800080;">$url</span><span style="color: #000000;">); </span><span style="color: #008080;">3</span> curl_setopt(<span style="color: #800080;">$ch</span>,CURLOPT_RETURNTRANSFER,<span style="color: #0000ff;">true</span><span style="color: #000000;">); </span><span style="color: #008080;">4</span> curl_setopt(<span style="color: #800080;">$ch</span>,CURLOPT_TIMEOUT,5<span style="color: #000000;">); </span><span style="color: #008080;">5</span> <span style="color: #800080;">$html</span> = curl_exec(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;">6</span> <span style="color: #0000ff;">if</span>(<span style="color: #800080;">$html</span> !== <span style="color: #0000ff;">false</span><span style="color: #000000;">){ </span><span style="color: #008080;">7</span> <span style="color: #0000ff;">echo</span> <span style="color: #800080;">$html</span><span style="color: #000000;">; </span><span style="color: #008080;">8</span> }
Due to frequent use, curl_setopt_array can be used to write it in the form of a function:
<span style="color: #008080;"> 1</span> <span style="color: #0000ff;">function</span> get_html(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span> = <span style="color: #0000ff;">array</span><span style="color: #000000;">()){ </span><span style="color: #008080;"> 2</span> <span style="color: #800080;">$options</span>[CURLOPT_RETURNTRANSFER] = <span style="color: #0000ff;">true</span><span style="color: #000000;">; </span><span style="color: #008080;"> 3</span> <span style="color: #800080;">$options</span>[CURLOPT_TIMEOUT] = 5<span style="color: #000000;">; </span><span style="color: #008080;"> 4</span> <span style="color: #800080;">$ch</span> = curl_init(<span style="color: #800080;">$url</span><span style="color: #000000;">); </span><span style="color: #008080;"> 5</span> curl_setopt_array(<span style="color: #800080;">$ch</span>,<span style="color: #800080;">$options</span><span style="color: #000000;">); </span><span style="color: #008080;"> 6</span> <span style="color: #800080;">$html</span> = curl_exec(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 7</span> curl_close(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 8</span> <span style="color: #0000ff;">if</span>(<span style="color: #800080;">$html</span> === <span style="color: #0000ff;">false</span><span style="color: #000000;">){ </span><span style="color: #008080;"> 9</span> <span style="color: #0000ff;">return</span> <span style="color: #0000ff;">false</span><span style="color: #000000;">; </span><span style="color: #008080;">10</span> <span style="color: #000000;"> } </span><span style="color: #008080;">11</span> <span style="color: #0000ff;">return</span> <span style="color: #800080;">$html</span><span style="color: #000000;">; </span><span style="color: #008080;">12</span> }
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.baidu.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>);
Sometimes you need to pass some specific parameters to get the correct page. For example, if you want to get the NetEase page now:
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>);
You will see a blank with nothing, then use curl_getinfo to write a function and see what happens:
<span style="color: #008080;"> 1</span> <span style="color: #0000ff;">function</span> get_info(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span> = <span style="color: #0000ff;">array</span><span style="color: #000000;">()){ </span><span style="color: #008080;"> 2</span> <span style="color: #800080;">$options</span>[CURLOPT_RETURNTRANSFER] = <span style="color: #0000ff;">true</span><span style="color: #000000;">; </span><span style="color: #008080;"> 3</span> <span style="color: #800080;">$options</span>[CURLOPT_TIMEOUT] = 5<span style="color: #000000;">; </span><span style="color: #008080;"> 4</span> <span style="color: #800080;">$ch</span> = curl_init(<span style="color: #800080;">$url</span><span style="color: #000000;">); </span><span style="color: #008080;"> 5</span> curl_setopt_array(<span style="color: #800080;">$ch</span>,<span style="color: #800080;">$options</span><span style="color: #000000;">); </span><span style="color: #008080;"> 6</span> <span style="color: #800080;">$html</span> = curl_exec(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 7</span> <span style="color: #800080;">$info</span> = curl_getinfo(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 8</span> curl_close(<span style="color: #800080;">$ch</span><span style="color: #000000;">); </span><span style="color: #008080;"> 9</span> <span style="color: #0000ff;">return</span> <span style="color: #800080;">$info</span><span style="color: #000000;">; </span><span style="color: #008080;">10</span> <span style="color: #000000;">} </span><span style="color: #008080;">11</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">; </span><span style="color: #008080;">12</span> <span style="color: #008080;">var_dump</span>(get_info(<span style="color: #800080;">$url</span>));
You can see that http_code 302 is redirected. At this time, you need to pass some parameters:
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #800080;">$options</span>[CURLOPT_FOLLOWLOCATION] = <span style="color: #0000ff;">true</span><span style="color: #000000;">; </span><span style="color: #008080;">3</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span>);
You will find out why such a page is different from the one accessed by our computer? ? ?
It seems that the parameters are still not enough for the server to determine what device our client is on, so it returns a normal version
Looks like we have to send USERAGENT
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #800080;">$options</span>[CURLOPT_FOLLOWLOCATION] = <span style="color: #0000ff;">true</span><span style="color: #000000;">; </span><span style="color: #008080;">3</span> <span style="color: #800080;">$options</span>[CURLOPT_USERAGENT] = 'Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0'<span style="color: #000000;">; </span><span style="color: #008080;">4</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>,<span style="color: #800080;">$options</span>);
OKNow the page has come out. Basically thisget_htmlfunction can basically achieve such extended functions
Of course, there are other ways to achieve it. When you clearly know the NetEase webpage, you can simply collect it:
<span style="color: #008080;">1</span> <span style="color: #800080;">$url</span> = 'http://www.163.com/index.html'<span style="color: #000000;">; </span><span style="color: #008080;">2</span> <span style="color: #0000ff;">echo</span> get_html(<span style="color: #800080;">$url</span>);
This can also be collected normally
Today comes to an end byebye!!