Heritrix has 5 chains. It is said on the Internet that processing is done in the Extractor chain. This chain is an extraction chain, which can be responsible for parsing the content of the html page and then further filtering. But currently I just want to filter out html, htm, shtml, xshtml and other files by judging the suffix name. Therefore, doing processing in Extractor is a little bit useful, so I do processing in PostProcessor chain. The detailed introduction is as follows:
FronitierScheduler is a PostProcessor. Its function is to add the links analyzed in Extractor to Froniter for the next step of processing (file writing, etc.).
Specific method:
1. Find the FrontierScheduler.java file under the org.archive.crawler.postprocessor package
2. Find the protected void schedule(CandidateURI caUri) of the FrontierScheduler class ) method
3. My rewriting is as follows:
<span style="font-size:14px;"> protected void schedule(CandidateURI caUri) { //将caUri转为String格式 String url = caUri.toString(); //打印出来查看一下 System.out.println("------" + url); //剔除以特定后缀名结尾的URL if(url.endsWith(".jpeg") ||url.endsWith(".jpg") ||url.endsWith(".gif") ||url.endsWith(".css") ||url.endsWith(".doc") ||url.endsWith(".zip") ||url.endsWith(".png") ||url.endsWith(".js") ||url.endsWith(".pdf") ||url.endsWith(".xls") ||url.endsWith(".rar") ||url.endsWith(".exe") ||url.endsWith(".txt")){ return; } //将未剔除的文件加入到下一步处理(写入到本地磁盘的处理等等) getController().getFrontier().schedule(caUri); }</span>