


The handwritten crawler program can run successfully, but the efficiency is too low. It only takes more than ten seconds to crawl a piece of data. Please give me some advice to improve the efficiency. Thank you! ! _html/css_WEB-ITnose
Parser parses html crawler
import ...../**
* Get **** data
*/
public class DoMain3 {
/**
* Get the page content based on the webpage URL
*/
public String getHtmlString(String url){
String hs="";
try {
URL u = new URL(url);
HttpURLConnection conn = (HttpURLConnection)u.openConnection();
conn.setRequestProperty("User-Agent","MSIE 7.0");
StringBuffer HtmlString = new StringBuffer();
BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(),"utf-8"));
String line="";
while((line=br.readLine())!=null){
HtmlString.append(line "n");
}
hs=HtmlString.toString();
System.out.println(url);
} catch (Exception e) {
System.out.println("URL地址加载出错!!");
e.printStackTrace();
}
return hs;
}
public static void main(String rags[]){
Dao d = new Dao();
DoMain3 dm = new DoMain3();
String title="";
String section="";
String content="";
String contentTitle="";
int count=110;
String url="http://*************************" ;
if(d.createTable()){
System.out.println("建表成功!!!");
try {
//加载标题页面
Document doc = Jsoup.parse(dm.getHtmlString(url));
Element titles = doc.getElementById("maincontent");
Elements lis=titles.getElementsByTag("li");
//*********************标题****************************
for(int i=1;i
if(a.toString().equals("")){
title=lis.get(i).text();
contentTitle=title;
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题数据插入成功!!!");
System.out.println("*****************" count "*****************");
}else{
System.out.println("第" (i 1) "题节数据插入失败!!!");
System.out.println("*****************" count "*****************");
break;
}
count ;
continue;
}else{
title=a.get(0).text();
url="http://****************" a.get(0).attr("href");
//加载章节页面
Document doc2=Jsoup.parse(dm.getHtmlString(url));
Element sections =doc2.getElementById("maincontent");
Elements ls = sections.getElementsByTag("li");
//**********************节************************
for(int j=0;j
if(link.toString().equals("")){
section=ls.get(j).text();
contentTitle=title " " section;
}else{
section = link.get(0).text();
url="http:*******************" link.get(0).attr("href");
//加载内容页面
Document doc3=Jsoup.parse(dm.getHtmlString(url));
Element contents=doc3.getElementById("maincontent");
content=contents.text();
//处理内容字符串
content=content.substring(content.indexOf("?") "?".length());
content=content.replace("'", "''");
contentTitle=title " " section;
}
System.out.println("****************" count "******************");
System.out.println("正在读第" (i 1) "题" (j 1) "节");
//往数据库插入数据
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题" (j 1) "节数据插入成功!!!");
System.out.println("*****************" count "*****************");
count ;
}else{
System.out.println("第" (i 1) "题" (j 1) "节数据插入失败!!!");
System.out.println("*****************" count "*****************");
break;
}
}//end for
}
System.out.println("No." (i 1) "Question collection completed");
}//end for
System.out.println("Collection completed!!");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace( ; , I always pause at these two sentences for a long time when debugging
1.BufferedReader br = new BufferedReader(new InputStreamReader(conn.getInputStream(), "utf-8"))
2.while ((line=br.readLine())!=null){ HtmlString.append(line "n");
}
Use jsoup, it is very simple and easy to crawl
I used jsoup at the beginning and the efficiency was even lower than this. Just at Document doc = Jsoup.parse(method.getResponseBodyAsString()); I couldn’t walk this step. It was a headache. Someone suggested that I use sax to parse. , but can sax be used to parse html?
Multi-threading to increase bandwidth
**
* 获取**************的数据
* @author wf
*
*/
public class DoMain5 {
public Document getDoc(String url){
Document doc=null;
try {
doc=Jsoup.connect(url).get();
} catch (Exception e) {
System.out.println("文档解析失败!!");
e.printStackTrace();
}
return doc;
}
public static void main(String rags[]){
Dao d = new Dao();
DoMain5 dm = new DoMain5();
String title="";
String section="";
String content="";
String contentTitle="";
int count=630;
String url="******************" ;
if(d.createTable()){
System.out.println("建表成功!!!");
try {
Document doc = dm.getDoc(url);
System.out.println(doc);
Element titles = doc.getElementById("maincontent");
Elements lis=titles.getElementsByTag("li");
//*********************标题****************************
for(int i=1;i
if(a.toString().equals("")){
title=lis.get(i).text();
contentTitle=title;
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题数据插入成功!!!");
System.out.println("*****************" count "*****************");
}else{
System.out.println("第" (i 1) "题节数据插入失败!!!");
System.out.println("*****************" count "*****************");
break;
}
count ;
continue;
}else{
title=a.get(0).text();
url="http:***************" a.get(0).attr("href");
Document doc2=dm.getDoc(url);
Element sections =doc2.getElementById("maincontent");
Elements ls = sections.getElementsByTag("li");
//**********************节************************
for(int j=507;j
if(link.toString().equals("")){
section=ls.get(j).text();
contentTitle=title " " section;
}else{
section = link.get(0).text();
url="http:****************" link.get(0).attr("href");
Document doc3=dm.getDoc(url);
Element contents=doc3.getElementById("maincontent");
content=contents.text();
//处理内容字符串
content=content.substring(content.indexOf("?") "?".length());
content=content.replace("'", "''");
contentTitle=title " " section;
}
System.out.println("****************" count "******************");
System.out.println("正在读第" (i 1) "题" (j 1) "节");
String data[]={contentTitle,title,section,content,url};
if(d.pinsertData(data)){
System.out.println("第" (i 1) "题" (j 1) "节数据插入成功!!!");
System.out.println("*****************" count "*****************");
count ;
}else{
System.out.println("第" (i 1) "题" (j 1) "节数据插入失败!!!");
System.out.println("*****************" count "*****************");
break;
}
}//end for
}
System.out.println("第" (i 1) "题采集完毕");
break;
}//end for
System.out.println("Collection completed!!");
} catch (Exception e) {
e.printStackTrace();
}
Passed After everyone’s loud suggestions and modifications, the efficiency of this program has been significantly improved, but now it will throw the following two exceptions anytime and anywhere when running. Please give me some advice on how to solve it:
1.java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java :218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http .HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1064)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:373)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:429)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup. helper.HttpConnection.get(HttpConnection.java:153)
at com.wanfang.dousact.DoMain5.getDoc(DoMain5.java:35)
at com.wanfang.dousact.DoMain5.main(DoMain5.java: 61)
2.java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java: 333)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect( SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:519)
at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
at sun.net. www.http.HttpClient.openServer(HttpClient.java:394)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
at sun.net.www.http.HttpClient.
at sun.net.www.http.HttpClient.New(HttpClient.java:306)
at sun.net.www.http.HttpClient.New(HttpClient .java:323)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient
(HttpURLConnection.java:852)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect
(HttpURLConnection.java:793)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:718)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection .java:425)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at com.wanfang.dousact.DoMain5.getDoc(DoMain5.java:35)
at com.wanfang.dousact.DoMain5.main (DoMain5.java:87)

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

HTML is suitable for beginners because it is simple and easy to learn and can quickly see results. 1) The learning curve of HTML is smooth and easy to get started. 2) Just master the basic tags to start creating web pages. 3) High flexibility and can be used in combination with CSS and JavaScript. 4) Rich learning resources and modern tools support the learning process.

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

AnexampleofastartingtaginHTMLis,whichbeginsaparagraph.StartingtagsareessentialinHTMLastheyinitiateelements,definetheirtypes,andarecrucialforstructuringwebpagesandconstructingtheDOM.
