Home Web Front-end HTML Tutorial Jsoup crawls page data and understands HTTP message headers_html/css_WEB-ITnose

Jsoup crawls page data and understands HTTP message headers_html/css_WEB-ITnose

Jun 24, 2016 am 11:55 AM

Recommend a book: Hacker Attack and Defense Technology Collection. Web Practical Chapter;

By the way, I leave a question: Is it possible to access the web in large quantities through jsoup or Small nameserver, bringing it down? In fact, friends who are familiar with jsoup can use it to parse URLs to do a very shameless thing (the source code is kept confidential). Haha, let’s briefly introduce JSOUP.

jsoup is a Java-based HTML parser that can directly parse a URL address, HTML text string, and HTML file. It provides a very low-effort API to retrieve and manipulate data through DOM, CSS, and jQuery-like manipulation methods.

Official website download address: http://jsoup.org/download, download core library. Import project

1: Parse HTML text string

[java] view plain copy

  1. /**
  2. * Parse an html document. String type
  3. */
  4. ublic static void parseStringHtml(String html) {
  5. Document doc = Jsoup.parse(html );//Convert String into document format
  6. Elements e=doc.body().getAllElements();//Get all node sets under body
  7. Elements e1=doc.select ("head");//Get the head node set
  8. Element e2=doc.getElementById("p");//Get the node with id="p" on the html
  9. System. out.println(e1);
2: Parse url. This part is the key point. Some URLs may not be able to obtain direct connections. for example: CSDN domain name website. In this case, the message header proxy must be set. Otherwise, an error will be reported: like HTTP error fetching URL. Status=403. Wait for http status exception. For specific HTTP status return codes, please refer to the last part, or the recommended book

[java] view plain copy

  1. /**
  2. * Get html through the request address
  3. */
  4. public static void parseRequestUrl(String url) throws IOException{
  5. Connection con = Jsoup.connect(url);//Get the request connection
  6. //                // MIME type acceptable to the browser.
  7. // con.header("Accept", "text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8");
  8. // con.header("Accept-Encoding", "gzip, deflate");
  9. // con.header("Accept-Language", "zh-cn,zh;q=0.8,en- us;q=0.5,en;q=0.3");
  10. // con.header("Connection", "keep-alive");
  11. // con.header(" Host", url);
  12. // con.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0");
  13. Document doc=con.get();
  14. Elements hrefs=doc.select("a[href=/kff517]"); //Attributes behind the node are not required
  15. Elements test=doc.select("html body div#container div#body div#main div.main div#article_details.details div.article_manage span.link_view");
  16. System.out.println(hrefs) ;
  17. System.out.println(test.text());//==.html Gets the text in the node, similar to the method in js
  18. }

3: Parse a local html file. This is similar, but change the way to obtain DOCUMENT.


Collected some information about HTTP message headers:

GET /simple.htm HTTP/1.1 ---Request method, request object, request http protocol
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */* --Refers to the Content-type that the browser can receive
Accept -Language: zh-cn ---Receive language
Accept-Encoding: gzip, deflate --Receive encoding
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1 ; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727) Related information of this machine, including browser type, operating system information, etc. Many websites can display the browser and Operating system version, because this information can be obtained from here.
Host: localhost:8080 Host and port, generally refers to the domain name on the Internet
Connection: Keep-Alive Whether a persistent connection is required


Server The complete HTTP message sent back is as follows:
HTTP/1.1 200 OK ---HTTP/1.1 indicates the protocol used. 200OK refers to the status code returned by the server, which normally returns
Server: Microsoft-IIS/5.1
X-Powered-By: ASP.NET
Date: Fri, 03 Mar 2006 06 :34:03 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Fri, 03 Mar 2006 06:33:18 GMT< ;CR>
ETag: "5ca4f75b8c3ec61:9ee"
Content-Length: 37

hello world

Note: was added by me to represent a line break, it can be deleted, it is meaningless

Overview of HTTP request headers
HTTP client program ( For example, a browser must specify the request type (usually GET or POST) when sending a request to the server. If necessary, the client program can also choose to send other request headers. Most request headers are not required, except Content-Length. For POST requests, Content-Length must appear.
The following are some of the most common request headers.

Accept: The MIME type accepted by the browser.
Accept-Charset: The character set acceptable to the browser
Accept-Encoding: The data encoding method that the browser can decode, such as gzip. Servlet can return gzip-encoded HTML pages to browsers that support gzip. . In many cases this can reduce download time by 5 to 10 times.
Accept-Language: The language type desired by the browser, used when the server can provide more than one language version.
Authorization: Authorization information usually appears in the response to the WWW-Authenticate header sent by the server.
Connection: Indicates whether a persistent connection is required. If the Servlet sees that the value here is "Keep-Alive", or the request uses HTTP 1.1 (HTTP 1.1 uses persistent connections by default. It can take advantage of persistent connections to significantly reduce the download time when the page contains multiple elements (such as Applets, images). To achieve this, the Servlet needs to send a Content-Length header in the response. The simplest way to achieve this is to first write the content to a ByteArrayOutputStream, and then calculate its size before officially writing the content out.
Content-Length: Indicates the length of the request message body.
Cookie: This is one of the most important request header information
From: The email address of the request sender, which is used by some special web client programs and will not be used by the browser.
Host: The host and port in the initial URL.
If-Modified-Since: Return the requested content only if it has been modified after the specified date, otherwise return a 304 "Not Modified" response.
Pragma: Specifying a "no-cache" value means that the server must return a refreshed document, even if it is a proxy server and already has a local copy of the page.
Referer: Contains a URL from which the user accesses the currently requested page.
User-Agent: Browser type, this value is very useful if the content returned by the Servlet is related to the browser type.
UA-Pixels, UA-Color, UA-OS, UA-CPU: Non-standard request headers sent by certain versions of IE browsers, indicating screen size, color depth, operating system and CPU type.

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What is the purpose of the <progress> element? What is the purpose of the <progress> element? Mar 21, 2025 pm 12:34 PM

The article discusses the HTML &lt;progress&gt; element, its purpose, styling, and differences from the &lt;meter&gt; element. The main focus is on using &lt;progress&gt; for task completion and &lt;meter&gt; for stati

What is the purpose of the <datalist> element? What is the purpose of the <datalist> element? Mar 21, 2025 pm 12:33 PM

The article discusses the HTML &lt;datalist&gt; element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

What are the best practices for cross-browser compatibility in HTML5? What are the best practices for cross-browser compatibility in HTML5? Mar 17, 2025 pm 12:20 PM

Article discusses best practices for ensuring HTML5 cross-browser compatibility, focusing on feature detection, progressive enhancement, and testing methods.

How do I use HTML5 form validation attributes to validate user input? How do I use HTML5 form validation attributes to validate user input? Mar 17, 2025 pm 12:27 PM

The article discusses using HTML5 form validation attributes like required, pattern, min, max, and length limits to validate user input directly in the browser.

What is the purpose of the <meter> element? What is the purpose of the <meter> element? Mar 21, 2025 pm 12:35 PM

The article discusses the HTML &lt;meter&gt; element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates &lt;meter&gt; from &lt;progress&gt; and ex

What is the viewport meta tag? Why is it important for responsive design? What is the viewport meta tag? Why is it important for responsive design? Mar 20, 2025 pm 05:56 PM

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

What is the purpose of the <iframe> tag? What are the security considerations when using it? What is the purpose of the <iframe> tag? What are the security considerations when using it? Mar 20, 2025 pm 06:05 PM

The article discusses the &lt;iframe&gt; tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

Gitee Pages static website deployment failed: How to troubleshoot and resolve single file 404 errors? Gitee Pages static website deployment failed: How to troubleshoot and resolve single file 404 errors? Apr 04, 2025 pm 11:54 PM

GiteePages static website deployment failed: 404 error troubleshooting and resolution when using Gitee...

See all articles