


Practical crawler combat: Use PHP to crawl JD.com product information
In today’s e-commerce era, JD.com, as one of China’s largest comprehensive e-commerce companies, can even put tens of thousands of products on its shelves every day. For the majority of consumers, JD.com provides a wide range of product selections and advantageous price concessions. However, sometimes, we need to obtain JD product information in batches, quickly screen, compare, analyze, etc. At this time, we need to use crawler technology. In this article, we will introduce the implementation of using PHP language to write a crawler to help us quickly crawl JD.com product information.
- Preparation
First, we need to install the curl extension required by PHP and set some commonly used variables. The specific steps are as follows:
First, open the terminal or powershell and enter the following command to install the curl extension package:
sudo apt-get install php7.0-curl //ubuntu系统安装
brew install curl-openssl php-curl //macOS系统安装
Next, we need to set some simple variables in the PHP code to facilitate us used in subsequent code. For example, we define a $jgname variable to represent the access address of JD.com, and another $skulist variable to represent the access address of each product. The code is as follows:
$jgname= "https://list.jd.com/list.html?cat=1318,1486,1490&ev=exbrand_13910&sort=sort_rank_asc&trans=1&JL=3_%E5%93%81%E7%89%8C_%E5%B0%8F%E7%B1%B3%EF%BC%88MI%EF%BC%89#J_crumbsBar"; $skulist="https://item.jd.com/1285310.html";
- Get the product list
Now that we have prepared the environment and required variables, we can start writing our crawler. First, we need to obtain the product list of the target JD product page. We can use curl tools and regular expressions to obtain the target link based on the access address of the JD.com product page (i.e. $jgname). Get product information such as price, number of reviews, product name, product number, etc. respectively.
The specific code is as follows:
$ch = curl_init();//初始化curl curl_setopt($ch, CURLOPT_URL,$jgname);//设置url属性 curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);//设置是否将curl_exec()获取的信息以字符串返回,而不是直接输出 $result = curl_exec ($ch);//执行一个curl会话 curl_close ($ch);//关闭curl会话 preg_match_all("/<li .*?</li>/", $result, $matches);//正则表达式把需要的内容取出来,即匹配<li>标签 $goodsinfo=array();//创建一个商品列表 foreach ($matches[0] as $item) { //获取商品信息 preg_match("/sku="(d+)"/",$item,$skuid); preg_match("/标题">s{0,}([dD]+?)s{0,}</a>/",$item,$titlename); preg_match("/<strong>¥</strong>[s ]{0,}<i>(d+.d+)</i>/",$item,$price); preg_match("/<divs{0,}class="p-commit">[s ]+<strong[^>]+>(d+)/",$item,$commentnum); preg_match("/<as{0,}href="([dD]+?)"/",$item,$link); //将商品信息存储到商品列表中 $goods=array( "title"=>trim($titlename[1]), "price"=>trim($price[1]), "link"=>"https:".$link[1], "skuid"=>trim($skuid[1]), "commentnum"=>trim($commentnum[1]) ); array_push($goodsinfo,$goods);//将商品信息添加到商品列表 //输出测试:打印商品信息 echo $goods['title']." ".$goods['price']." ".$goods['commentnum']." ".$goods['link']."<br>"; }
In the above code, we store the link and number of each product obtained in $goods'skuid' and 'link', and Other useful information (price, number of reviews, etc.) is placed in the $goods array. Finally, it is added to the $goodsinfo array through the array_push() function. You can use loop statements to output product list information for easy viewing of crawling results.
- Get product details
Now, we have obtained the product list information in the JD product table page, the next step is to obtain the detailed information of each product , and store it in the $goods array. We have obtained the number and link of each product in the $goods array in the previous step. Therefore, the next step is to open each link to obtain various useful product information. The specific code is as follows:
foreach ($goodsinfo as &$goods) { //更新每个商品的网页链接 $link="https://item.jd.com/".$goods['skuid'].".html"; $goods['link']=$link; $canBuy=true;//官网上可以买 //判断是否能够购买 preg_match('/无货/',file_get_contents($link)) && ($canBuy=false); //利用curl工具打开网页链接,获得网页代码 $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,$link); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); $html = curl_exec ($ch); curl_close ($ch); //分析网页代码,使用正则表达式获取商品种类,价格,颜色,库存数量等数据,并保存 preg_match_all('/<divs{0,}class="Ptable".*?>[s ]+<divs{0,}class="Ptable-item".*?>[s ]+([dD]*?)</div>/',$html,$items); preg_match_all('/<strong>商品名称</strong><em>(d.*)</em>/',$html,$item); $goods['title']=$item[1][0]; echo $goods['title']; if($canBuy) { foreach ($items[1] as &$item) { //去掉html标记、空格、换行符 $item=strip_tags($item); $item=str_replace(" ","",$item); $item=str_replace(" ","",$item); $item=str_replace(" ","",$item); $item=str_replace(" ","",$item); //切割字符串,获取键值对 preg_match_all('/([dD]*?):([dD]*?)[ ]/',$item,$item2); if(count($item2[1])>0){ for($i=0;$i<count($item2[1]);$i++){ if($item2[1][$i]=="价格"){ $goods['price']=$item2[2][$i]; }elseif($item2[1][$i]=="颜色"){ $goods['color']=$item2[2][$i]; }elseif($item2[1][$i]=="产地"){ $goods['producePlace']=$item2[2][$i]; }elseif($item2[1][$i]=="商品编号"){ $goods['goodsn']=$item2[2][$i]; }elseif($item2[1][$i]=="型号"){ $goods['model']=$item2[2][$i]; }elseif($item2[1][$i]=="商品毛重"){ $goods['grossWeight']=$item2[2][$i]; }elseif($item2[1][$i]=="规格"){ $goods['specifications']=$item2[2][$i]; } } } } //获取商品评论数 preg_match_all('/<as{0,}href="#comment"s{0,}target="_self">s{0,}[dD]+?<strongs{0,}class="curr-num">(d*)</',$html,$comment); $goods['commentnum']=$comment[1][0]; } }
In these codes, we use a technique similar to step 2, using the curl tool to obtain the detailed link of each product, and then using regular expressions to obtain some useful product information . We can output the obtained product details in the following way:
foreach ($goodsinfo as &$goods) { echo $goods['skuid']." ".$goods['title']." ".$goods['price']." ".$goods['commentnum']." ".$goods['link']."<br>"; }
That’s it for the whole process. In actual applications, we can make some adjustments and optimizations to the code based on actual needs, such as adding exception handling, setting request headers, adjusting crawling speed, etc. In short, on this basis, a stable and efficient crawler can be built to obtain JD product information and further assist e-commerce operations and analysis.
The above is the detailed content of Practical crawler combat: Use PHP to crawl JD.com product information. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



In this chapter, we will understand the Environment Variables, General Configuration, Database Configuration and Email Configuration in CakePHP.

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

To work with date and time in cakephp4, we are going to make use of the available FrozenTime class.

Working with database in CakePHP is very easy. We will understand the CRUD (Create, Read, Update, Delete) operations in this chapter.

To work on file upload we are going to use the form helper. Here, is an example for file upload.

In this chapter, we are going to learn the following topics related to routing ?

CakePHP is an open-source framework for PHP. It is intended to make developing, deploying and maintaining applications much easier. CakePHP is based on a MVC-like architecture that is both powerful and easy to grasp. Models, Views, and Controllers gu

Validator can be created by adding the following two lines in the controller.
