Java获取任意http网页源代码的方法分享-java教程-PHP中文网

首页

Java

java教程

Java获取任意http网页源代码的方法分享

Sep 28, 2017 am 10:09 AM

http java 网页

这篇文章主要介绍了Java获取任意http网页源代码的方法,可实现获取网页代码以及去除HTML标签的代码功能,涉及Java正则操作相关实现技巧,需要的朋友可以参考下

本文实例讲述了JAVA获取任意http网页源代码。分享给大家供大家参考，具体如下：

JAVA获取任意http网页源代码可实现如下功能：

1. 获取任意http网页的代码
2. 获取任意http网页去掉HTML标签的代码

Webpage类：

/**
 * 网页操作相关类
 */
package test;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
 * @author winddack
 *
 */
public class Webpage {
  private String pageUrl;//定义需要操作的网页地址
  private String pageEncode=&quot;UTF8&quot;;//定义需要操作的网页的编码
  public String getPageUrl() {
    return pageUrl;
  }
  public void setPageUrl(String pageUrl) {
    this.pageUrl = pageUrl;
  }
  public String getPageEncode() {
    return pageEncode;
  }
  public void setPageEncode(String pageEncode) {
    this.pageEncode = pageEncode;
  }
  //定义取源码的方法
  public String getPageSource()
  {
    StringBuffer sb = new StringBuffer();
    try {
      //构建一URL对象
      URL url = new URL(pageUrl);
      //使用openStream得到一输入流并由此构造一个BufferedReader对象
      BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), pageEncode));
      String line;
      //读取www资源
      while ((line = in.readLine()) != null)
      {
        sb.append(line);
      }
      in.close();
    }
    catch (Exception ex)
    {
      System.err.println(ex);
    }
    return sb.toString();
  }
  //定义一个把HTML标签删除过的源码的方法
  public String getPageSourceWithoutHtml()
  {
    final String regEx_script = &quot;&lt;script[^&gt;]*?&gt;[\\s\\S]*?&lt;\\/script&gt;&quot;; // 定义script的正则表达式
    final String regEx_style = &quot;&lt;style[^&gt;]*?&gt;[\\s\\S]*?&lt;\\/style&gt;&quot;; // 定义style的正则表达式
    final String regEx_html = &quot;&lt;[^&gt;]+&gt;&quot;; // 定义HTML标签的正则表达式
    final String regEx_space = &quot;\\s*|\t|\r|\n&quot;;//定义空格回车换行符
    String htmlStr = getPageSource();//获取未处理过的源码
    Pattern p_script = Pattern.compile(regEx_script, Pattern.CASE_INSENSITIVE);
    Matcher m_script = p_script.matcher(htmlStr);
    htmlStr = m_script.replaceAll(&quot;&quot;); // 过滤script标签
    Pattern p_style = Pattern.compile(regEx_style, Pattern.CASE_INSENSITIVE);
    Matcher m_style = p_style.matcher(htmlStr);
    htmlStr = m_style.replaceAll(&quot;&quot;); // 过滤style标签
    Pattern p_html = Pattern.compile(regEx_html, Pattern.CASE_INSENSITIVE);
    Matcher m_html = p_html.matcher(htmlStr);
    htmlStr = m_html.replaceAll(&quot;&quot;); // 过滤html标签
    Pattern p_space = Pattern.compile(regEx_space, Pattern.CASE_INSENSITIVE);
    Matcher m_space = p_space.matcher(htmlStr);
    htmlStr = m_space.replaceAll(&quot;&quot;); // 过滤空格回车标签
    htmlStr = htmlStr.trim(); // 返回文本字符串
    htmlStr = htmlStr.replaceAll(&quot; &quot;, &quot;&quot;);
    htmlStr = htmlStr.substring(0, htmlStr.indexOf(&quot;。&quot;)+1);
    return htmlStr;
  }
}

登录后复制

调用：

Webpage page=new Webpage();
page.setPageUrl(&quot;http://www.baidu.com&quot;);
String code=page.getPageSourceWithoutHtml();
System.out.println(code);

登录后复制

以上是Java获取任意http网页源代码的方法分享的详细内容。更多信息请关注PHP中文网其他相关文章！

本站声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

Java获取任意http网页源代码的方法分享

热门文章

热门工具标签

热门文章

热门文章标签

记事本++7.3.1

SublimeText3汉化版

禅工作室 13.0.1

Dreamweaver CS6

SublimeText3 Mac版

热门话题