This article brings you an introduction to the method of using Jsoup to implement crawler technology. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
1. Brief description of Jsoup
There are many crawler frameworks supported in Java, such as WebMagic, Spider, Jsoup, etc. Today we use Jsoup to implement a simple crawler program.
Jsoup has a very convenient API to process html documents, such as referring to the document traversal method of DOM objects, referring to the usage of CSS selectors, etc., so we can use Jsoup to quickly master the method of crawling page data. Skill.
2. Quick start
1) Write an HTML page
The product information of the table in the page is ours The data to crawl. Among them, the attributes are the product name of the pname class, and the product pictures belonging to the pimg class.
2) Use HttpClient to read HTML pages
HttpClient is a tool for processing Http protocol data. It can be used to read HTML pages into java programs as input streams. You can download the HttpClient jar package from http://hc.apache.org/.
3) Use Jsoup to parse html string
By introducing the Jsoup tool, directly call the parse method to parse a string describing the content of the html page to obtain A Document object. The Document object obtains the specified content on the html page by operating the DOM tree. For related APIs, please refer to the Jsoup official documentation: https://jsoup.org/cookbook/
Below we use Jsoup to obtain the product name and price information specified in the above html.
So far, we have implemented the function of using HttpClient Jsoup to crawl HTML page data. Next, we make the effect more intuitive, such as saving the crawled data to the database and saving the images to the server.
3. Save the crawled page data
1) Save ordinary data to the database
Encapsulate the crawled data into entity beans , and stored in the database.
#2) Save the picture to the server
Save the picture to the server locally by downloading the picture directly.
4. Summary
This case simply implements the use of HttpClient Jsoup to crawl network data. There are many other things about the crawler technology itself. The places worth digging into will be explained to you later.
The above is the detailed content of Introduction to the method of using Jsoup to implement crawler technology. For more information, please follow other related articles on the PHP Chinese website!