


Jsoup crawls page data and understands HTTP message headers_html/css_WEB-ITnose
Recommend a book: Hacker Attack and Defense Technology Collection. Web Practical Chapter;
By the way, I leave a question: Is it possible to access the web in large quantities through jsoup or Small nameserver, bringing it down? In fact, friends who are familiar with jsoup can use it to parse URLs to do a very shameless thing (the source code is kept confidential). Haha, let’s briefly introduce JSOUP.
jsoup is a Java-based HTML parser that can directly parse a URL address, HTML text string, and HTML file. It provides a very low-effort API to retrieve and manipulate data through DOM, CSS, and jQuery-like manipulation methods.
Official website download address: http://jsoup.org/download, download core library. Import project
1: Parse HTML text string
[java] view plain copy
- /**
- * Parse an html document. String type
- */
- ublic static void parseStringHtml(String html) {
- Document doc = Jsoup.parse(html );//Convert String into document format
- Elements e=doc.body().getAllElements();//Get all node sets under body
- Elements e1=doc.select ("head");//Get the head node set
- Element e2=doc.getElementById("p");//Get the node with id="p" on the html
- System. out.println(e1);
[java] view plain copy
- /**
- * Get html through the request address
- */
- public static void parseRequestUrl(String url) throws IOException{
- Connection con = Jsoup.connect(url);//Get the request connection
- // // MIME type acceptable to the browser.
- // con.header("Accept", "text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8");
- // con.header("Accept-Encoding", "gzip, deflate");
- // con.header("Accept-Language", "zh-cn,zh;q=0.8,en- us;q=0.5,en;q=0.3");
- // con.header("Connection", "keep-alive");
- // con.header(" Host", url);
- // con.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0");
- Document doc=con.get();
- Elements hrefs=doc.select("a[href=/kff517]"); //Attributes behind the node are not required
- Elements test=doc.select("html body div#container div#body div#main div.main div#article_details.details div.article_manage span.link_view");
- System.out.println(hrefs) ;
- System.out.println(test.text());//==.html Gets the text in the node, similar to the method in js
- }
3: Parse a local html file. This is similar, but change the way to obtain DOCUMENT.
Collected some information about HTTP message headers:
GET /simple.htm HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
Accept -Language: zh-cn
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1 ; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)
Host: localhost:8080
Connection: Keep-Alive
Server The complete HTTP message sent back is as follows:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.1
X-Powered-By: ASP.NET
Date: Fri, 03 Mar 2006 06 :34:03 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Fri, 03 Mar 2006 06:33:18 GMT< ;CR>
ETag: "5ca4f75b8c3ec61:9ee"
Content-Length: 37
hello world body>
Note:
Overview of HTTP request headers
HTTP client program ( For example, a browser must specify the request type (usually GET or POST) when sending a request to the server. If necessary, the client program can also choose to send other request headers. Most request headers are not required, except Content-Length. For POST requests, Content-Length must appear.
The following are some of the most common request headers.
Accept: The MIME type accepted by the browser.
Accept-Charset: The character set acceptable to the browser
Accept-Encoding: The data encoding method that the browser can decode, such as gzip. Servlet can return gzip-encoded HTML pages to browsers that support gzip. . In many cases this can reduce download time by 5 to 10 times.
Accept-Language: The language type desired by the browser, used when the server can provide more than one language version.
Authorization: Authorization information usually appears in the response to the WWW-Authenticate header sent by the server.
Connection: Indicates whether a persistent connection is required. If the Servlet sees that the value here is "Keep-Alive", or the request uses HTTP 1.1 (HTTP 1.1 uses persistent connections by default. It can take advantage of persistent connections to significantly reduce the download time when the page contains multiple elements (such as Applets, images). To achieve this, the Servlet needs to send a Content-Length header in the response. The simplest way to achieve this is to first write the content to a ByteArrayOutputStream, and then calculate its size before officially writing the content out.
Content-Length: Indicates the length of the request message body.
Cookie: This is one of the most important request header information
From: The email address of the request sender, which is used by some special web client programs and will not be used by the browser.
Host: The host and port in the initial URL.
If-Modified-Since: Return the requested content only if it has been modified after the specified date, otherwise return a 304 "Not Modified" response.
Pragma: Specifying a "no-cache" value means that the server must return a refreshed document, even if it is a proxy server and already has a local copy of the page.
Referer: Contains a URL from which the user accesses the currently requested page.
User-Agent: Browser type, this value is very useful if the content returned by the Servlet is related to the browser type.
UA-Pixels, UA-Color, UA-OS, UA-CPU: Non-standard request headers sent by certain versions of IE browsers, indicating screen size, color depth, operating system and CPU type.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

Article discusses best practices for ensuring HTML5 cross-browser compatibility, focusing on feature detection, progressive enhancement, and testing methods.

The article discusses using HTML5 form validation attributes like required, pattern, min, max, and length limits to validate user input directly in the browser.

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

GiteePages static website deployment failed: 404 error troubleshooting and resolution when using Gitee...
