Home

Web Front-end

HTML Tutorial

The handwritten crawler program can run successfully, but the efficiency is too low. It only takes more than ten seconds to crawl a piece of data. Please give me some advice to improve the efficiency. Thank you! ! _html/css_WEB-ITnose

The handwritten crawler program can run successfully, but the efficiency is too low. It only takes more than ten seconds to crawl a piece of data. Please give me some advice to improve the efficiency. Thank you! ! _html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 24, 2016 pm 12:25 PM

Parser parses html crawler

import .....
/**
* Get **** data
*/
public class DoMain3 {
/**
* Get the page content based on the webpage URL
*/
public String getHtmlString(String url){
String hs="";
try {
URL u = new URL(url);
HttpURLConnection conn = (HttpURLConnection)u.openConnection();
conn.setRequestProperty("User-Agent","MSIE 7.0");
StringBuffer HtmlString = new StringBuffer();
BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(),"utf-8"));
String line="";
while((line=br.readLine())!=null){
HtmlString.append(line "n");
}
hs=HtmlString.toString();
System.out.println(url);
} catch (Exception e) {
System.out.println("URL地址加载出错！！");
e.printStackTrace();
}
return hs;
}
public static void main(String rags[]){
Dao d = new Dao();
DoMain3 dm = new DoMain3();
String title="";
String section="";
String content="";
String contentTitle="";
int count=110;

String url="http://*************************" ;
if(d.createTable()){
System.out.println("建表成功！！！");
try {
//加载标题页面
Document doc = Jsoup.parse(dm.getHtmlString(url));
Element titles = doc.getElementById("maincontent");
Elements lis=titles.getElementsByTag("li");
//*********************标题****************************
for(int i=1;i Elements a = lis.get(i).getElementsByTag("a");
if(a.toString().equals("")){
title=lis.get(i).text();
contentTitle=title;
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题数据插入成功！！！");
System.out.println("*****************" count "*****************");
}else{
System.out.println("第" (i 1) "题节数据插入失败！！！");
System.out.println("*****************" count "*****************");
break;
}
count ;
continue;
}else{
title=a.get(0).text();
url="http://****************" a.get(0).attr("href");
//加载章节页面
Document doc2=Jsoup.parse(dm.getHtmlString(url));
Element sections =doc2.getElementById("maincontent");
Elements ls = sections.getElementsByTag("li");
//**********************节************************
for(int j=0;j Elements link = ls.get(j).getElementsByTag("a");
if(link.toString().equals("")){
section=ls.get(j).text();
contentTitle=title " " section;
}else{
section = link.get(0).text();
url="http:*******************" link.get(0).attr("href");
//加载内容页面
Document doc3=Jsoup.parse(dm.getHtmlString(url));
Element contents=doc3.getElementById("maincontent");
content=contents.text();
//处理内容字符串
content=content.substring(content.indexOf("?") "?".length());
content=content.replace("'", "''");
contentTitle=title " " section;
}
System.out.println("****************" count "******************");
System.out.println("正在读第" (i 1) "题" (j 1) "节");

//往数据库插入数据
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题" (j 1) "节数据插入成功！！！");
System.out.println("*****************" count "*****************");
count ;
}else{
System.out.println("第" (i 1) "题" (j 1) "节数据插入失败！！！");
System.out.println("*****************" count "*****************");
break;
}
}//end for
}

System.out.println("No." (i 1) "Question collection completed");

}//end for

System.out.println("Collection completed!!");

} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace( ; , I always pause at these two sentences for a long time when debugging
1.BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "utf-8"))

2.while ((line=br.readLine())!=null){

HtmlString.append(line "n");
}

Use jsoup, it is very simple and easy to crawl

I used jsoup at the beginning and the efficiency was even lower than this. Just at Document doc = Jsoup.parse(method.getResponseBodyAsString()); I couldn’t walk this step. It was a headache. Someone suggested that I use sax to parse. , but can sax be used to parse html?

Multi-threading to increase bandwidth

**
* 获取**************的数据
* @author wf
*
*/
public class DoMain5 {

public Document getDoc(String url){
Document doc=null;
try {
doc=Jsoup.connect(url).get();
} catch (Exception e) {
System.out.println("文档解析失败！！");
e.printStackTrace();
}
return doc;
}

public static void main(String rags[]){
Dao d = new Dao();
DoMain5 dm = new DoMain5();

String title="";
String section="";
String content="";
String contentTitle="";
int count=630;

String url="******************" ;

if(d.createTable()){
System.out.println("建表成功！！！");

try {
Document doc = dm.getDoc(url);
System.out.println(doc);
Element titles = doc.getElementById("maincontent");
Elements lis=titles.getElementsByTag("li");
//*********************标题****************************
for(int i=1;i Elements a = lis.get(i).getElementsByTag("a");
if(a.toString().equals("")){
title=lis.get(i).text();
contentTitle=title;

String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题数据插入成功！！！");
System.out.println("*****************" count "*****************");
}else{
System.out.println("第" (i 1) "题节数据插入失败！！！");
System.out.println("*****************" count "*****************");
break;
}
count ;
continue;
}else{
title=a.get(0).text();

url="http:***************" a.get(0).attr("href");
Document doc2=dm.getDoc(url);
Element sections =doc2.getElementById("maincontent");
Elements ls = sections.getElementsByTag("li");
//**********************节************************
for(int j=507;j Elements link = ls.get(j).getElementsByTag("a");
if(link.toString().equals("")){
section=ls.get(j).text();
contentTitle=title " " section;
}else{
section = link.get(0).text();
url="http:****************" link.get(0).attr("href");
Document doc3=dm.getDoc(url);
Element contents=doc3.getElementById("maincontent");
content=contents.text();
//处理内容字符串
content=content.substring(content.indexOf("?") "?".length());
content=content.replace("'", "''");
contentTitle=title " " section;
}
System.out.println("****************" count "******************");
System.out.println("正在读第" (i 1) "题" (j 1) "节");

String data[]={contentTitle,title,section,content,url};

if(d.pinsertData(data)){
System.out.println("第" (i 1) "题" (j 1) "节数据插入成功！！！");
System.out.println("*****************" count "*****************");
count ;
}else{
System.out.println("第" (i 1) "题" (j 1) "节数据插入失败！！！");
System.out.println("*****************" count "*****************");
break;
}
}//end for
}

System.out.println("第" (i 1) "题采集完毕");
break;
}//end for

System.out.println("Collection completed!!");

} catch (Exception e) {

e.printStackTrace();
}

Passed After everyone’s loud suggestions and modifications, the efficiency of this program has been significantly improved, but now it will throw the following two exceptions anytime and anywhere when running. Please give me some advice on how to solve it:

1.java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java :218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http .HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1064)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:373)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:429)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup. helper.HttpConnection.get(HttpConnection.java:153)
at com.wanfang.dousact.DoMain5.getDoc(DoMain5.java:35)
at com.wanfang.dousact.DoMain5.main(DoMain5.java: 61)

2.java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java: 333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect( SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:519)
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
at sun.net. www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient. (HttpClient.java:233)
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient .java:323)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient
(HttpURLConnection.java:852)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect
(HttpURLConnection.java:793)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:718)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection .java:425)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at com.wanfang.dousact.DoMain5.getDoc(DoMain5.java:35)
at com.wanfang.dousact.DoMain5.main (DoMain5.java:87)

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Saving in R.E.P.O. Explained (And Save Files)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7564

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

What is the purpose of the <progress> element? Mar 21, 2025 pm 12:34 PM

The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

What is the purpose of the <datalist> element? Mar 21, 2025 pm 12:33 PM

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

Is HTML easy to learn for beginners? Apr 07, 2025 am 12:11 AM

HTML is suitable for beginners because it is simple and easy to learn and can quickly see results. 1) The learning curve of HTML is smooth and easy to get started. 2) Just master the basic tags to start creating web pages. 3) High flexibility and can be used in combination with CSS and JavaScript. 4) Rich learning resources and modern tools support the learning process.

What is the purpose of the <meter> element? Mar 21, 2025 pm 12:35 PM

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

What is the purpose of the <iframe> tag? What are the security considerations when using it? Mar 20, 2025 pm 06:05 PM

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

What is the viewport meta tag? Why is it important for responsive design? Mar 20, 2025 pm 05:56 PM

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

The Roles of HTML, CSS, and JavaScript: Core Responsibilities Apr 08, 2025 pm 07:05 PM

HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

What is an example of a starting tag in HTML? Apr 06, 2025 am 12:04 AM

AnexampleofastartingtaginHTMLis,whichbeginsaparagraph.StartingtagsareessentialinHTMLastheyinitiateelements,definetheirtypes,andarecrucialforstructuringwebpagesandconstructingtheDOM.

See all articles