


Heritrix only crawls specific pages such as html and htm_html/css_WEB-ITnose
Heritrix has 5 chains. It is said on the Internet that processing is done in the Extractor chain. This chain is an extraction chain, which can be responsible for parsing the content of the html page and then further filtering. But currently I just want to filter out html, htm, shtml, xshtml and other files by judging the suffix name. Therefore, doing processing in Extractor is a little bit useful, so I do processing in PostProcessor chain. The detailed introduction is as follows:
FronitierScheduler is a PostProcessor. Its function is to add the links analyzed in Extractor to Froniter for the next step of processing (file writing, etc.).
Specific method:
1. Find the FrontierScheduler.java file under the org.archive.crawler.postprocessor package
2. Find the protected void schedule(CandidateURI caUri) of the FrontierScheduler class ) method
3. My rewriting is as follows:
<span style="font-size:14px;"> protected void schedule(CandidateURI caUri) { //将caUri转为String格式 String url = caUri.toString(); //打印出来查看一下 System.out.println("------" + url); //剔除以特定后缀名结尾的URL if(url.endsWith(".jpeg") ||url.endsWith(".jpg") ||url.endsWith(".gif") ||url.endsWith(".css") ||url.endsWith(".doc") ||url.endsWith(".zip") ||url.endsWith(".png") ||url.endsWith(".js") ||url.endsWith(".pdf") ||url.endsWith(".xls") ||url.endsWith(".rar") ||url.endsWith(".exe") ||url.endsWith(".txt")){ return; } //将未剔除的文件加入到下一步处理(写入到本地磁盘的处理等等) getController().getFrontier().schedule(caUri); }</span>

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

Article discusses best practices for ensuring HTML5 cross-browser compatibility, focusing on feature detection, progressive enhancement, and testing methods.

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

The article discusses using HTML5 form validation attributes like required, pattern, min, max, and length limits to validate user input directly in the browser.

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

GiteePages static website deployment failed: 404 error troubleshooting and resolution when using Gitee...
