Write a Java Zhihu crawler with zero foundation and store the captured content locally (2)-javaTutorial-php.cn

Write a Java Zhihu crawler with zero foundation and store the captured content locally (2)

黄舟

Release： 2016-12-24 11:50:45

Original

1796 people have browsed it

We encapsulate these two functions into a FileReaderWriter.java file for subsequent use.
Then we return to Zhihu crawler.
We need to add a function to Zhihu’s Zhihu encapsulation class to format the typesetting when writing to local.

The code is as follows:

public String writeString() {  
        String result = "";  
        result += "问题：" + question + "\r\n";  
        result += "描述：" + questionDescription + "\r\n";  
        result += "链接：" + zhihuUrl + "\r\n";  
        for (int i = 0; i < answers.size(); i++) {  
            result += "回答" + i + "：" + answers.get(i) + "\r\n";  
        }  
        result += "\r\n\r\n";  
        return result;  
}

Copy after login

OK, that’s almost it. Next, change System.out.println in the main method to

The code is as follows:

// 写入本地  
        for (Zhihu zhihu : myZhihu) {  
            FileReaderWriter.writeIntoFile(zhihu.writeString(),  
                    "D:/知乎_编辑推荐.txt", true);  
        }

Copy after login

Run it, and you can see what you originally saw on the console The content has been written into the local txt file:

Write a Java Zhihu crawler with zero foundation and store the captured content locally (2)

At first glance, there is no problem. If you look closely, you will find a problem: there are too many html tags, mainly and
.
We can process these tags during output.
First replace
with rn in the io stream, and then delete all html tags, so that it will look much clearer.

The code is as follows:

public String writeString() {  
    // 拼接写入本地的字符串  
    String result = "";  
    result += "问题：" + question + "\r\n";  
    result += "描述：" + questionDescription + "\r\n";  
    result += "链接：" + zhihuUrl + "\r\n";  
    for (int i = 0; i < answers.size(); i++) {  
        result += "回答" + i + "：" + answers.get(i) + "\r\n\r\n";  
    }  
    result += "\r\n\r\n\r\n\r\n";  
    // 将其中的html标签进行筛选  
    result = result.replaceAll("<br>", "\r\n");  
    result = result.replaceAll("<.*?>", "");  
    return result;  
}

Copy after login

The replaceAll function here can use regular expressions, so all tags are deleted at the end.

The above is the content of writing Java Zhihu crawler with zero foundation to store the captured content locally (2). For more related content, please pay attention to the PHP Chinese website (www.php.cn)!