Table of Contents
前言
按照业务分模块
依赖
搭建项目,整理包结构
关于API
解析帖子列表
使用Jsoup遇到的坑
Home Web Front-end HTML Tutorial V2EX论坛客户端之帖子信息爬取(一)_html/css_WEB-ITnose

V2EX论坛客户端之帖子信息爬取(一)_html/css_WEB-ITnose

Jun 21, 2016 am 08:58 AM

前言

由于逛V2EX比较多,决定用闲暇时间做一个安卓客户端,开源在这里 https://github.com/ihgoo/V2EX

工欲善其事,必先利其器,整个项目以Gradle方式构建,Androidstudio开发。

公司的项目也转向AndroidStudio了,studio有个我特别喜欢的特性,输入变量的时候 记不住变量开头是怎么拼写的,能记住后面也会自动提示出来!还有就是插件多,开发向傻瓜化发展,只关注业务逻辑即可。

按照业务分模块

论坛客户端按照业务逻辑会分为以下模块

  • 在非登录状态下的浏览模块

    • 帖子列表浏览
    • 帖子详情浏览
  • 用户模块

    • 登录模块
    • 用户信息模块
  • 在登录状态下的模块

    • 带登录状态的回帖,帖子详情浏览
    • 收藏,点赞
    • 回帖提醒

依赖

依赖库会尽量考虑使用原生控件以及成熟的框架

compile 'com.jakewharton:butterknife:6.1.0'    compile 'com.squareup.retrofit:retrofit:1.9.0'compile 'com.squareup:otto:+'compile 'com.facebook.fresco:fresco:0.1.0+'compile 'com.squareup.okhttp:okhttp-urlconnection:2.0.0'compile 'com.squareup.okhttp:okhttp:2.0.0'
Copy after login

butterknife:jack大神写的Ioc框架,媲美dagger,在idea/studio上面有支持butterknife的插件,一键findviewbyid!

retrofit: 强大的网络请求库。

fresco: 加载图片库,在使用这个之前,都是使用Imageloader的,刚出没多久的图片库,使用它是因为在项目中会有支持gif和支持图片渐进式显示的需求。

otto:eventBus框架!解耦神器,有了它,一切都变得简单了起来。

搭建项目,整理包结构

包结构如下图:

app:关于app的application等。

client:网络请求的报文头的定义,网络请求库的配置等。

core:基础框架,相当于mvc结构中的c,当然这里的c是指Controller

model:模型层

paser:解析层。无论是json,还是html,都是由此层解析生成实体类的。

persistence:放了一些常量类,数据库字段,Intnet请求字段,app配置字段等。

ui:视图展示层。

utils:一些顺手的工具类

项目以Gradle构建,app作为一个module,其他module作为挂载的形式挂到app上,优点是其他module可快速替换,且源码可修改(aar形式导入源码不可修改)。

关于API

由于调用官方json api有调用次数限制,于是考虑采用解析html页面来做。

电脑端html太过于庞大,为了省电降低app占用资源,解析的是wap端的页面,

可以通过修改请求头里的UA字段伪装成手机浏览器,在这里我用的是

public class ApiHeaders implements RequestInterceptor {    private String sessionId;    public void setSessionId(String sessionId) {        this.sessionId = sessionId;    }    public void clearSessionId() {        sessionId = null;    }    @Override public void intercept(RequestFacade request) {        request.addHeader("User-Agent", "Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19");        request.addHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");        if (sessionId != null) {        }    }}
Copy after login

解析html,java里可以用一个叫做jsoup的库,媲美python中的pyquery,支持类Jqery选择器方式来抓取自己想要的资源,简单方便粗暴极了。唯一的缺点就是如果页面里有些元素是ajax形式形成的,那这个就抓瞎了,可以使用httpunit,不过httpunit存在性能问题,要启动一个浏览器内核来运行这个网页,网页上的js完成后,再抓取信息。

解析帖子列表

在帖子列表中需要关注解析这几个数据

  • avatar 作者头像
  • node 节点名称
  • title 标题
  • small fade (time) 发表时间
  • small fade author 作者名称
  • count_livid 回帖数以下是用jsoup解析的代码,解析成功后塞到ForumItemBean这个实体类中,以集合形式返回给listView的adapter中

    public class PaserFourmList {    public static ArrayList<ForumItemBean> paser2ForumItem(String string) {        Document document = Jsoup.parse(string);        Elements elements = document.select(".cell").select(".item");        ArrayList<ForumItemBean> list = new ArrayList<>();        for (Element element : elements) {            // avatar            // node            // title            // small fade (time)            // small fade author            // count_livid            String avatar = element.select(".avatar").first().attr("src");            String node = element.select(".node").first().html();            String username = element.select(".small > strong").first().text();            String countLivid = element.getElementsByClass("count_livid").text();            String time = element.select(".small").select(".fade").get(1).text();            String href = element.getElementsByClass("item_title").html();            if (href.length()!=0){                href = href.substring(12, href.indexOf("#"));            }            int indexOf = time.indexOf("前");            if (indexOf != -1) {                time = time.substring(0, indexOf);            }            ForumItemBean forumItemBean = new ForumItemBean();            Member member = new Member();            member.setAvatarMini(avatar);            member.setUsername(username);            forumItemBean.setId(Misc.parseInt(href, 0));            forumItemBean.setMember(member);            forumItemBean.setLastTime(time);            forumItemBean.setReplies(Misc.parseInt(countLivid, 0));            forumItemBean.setTitle(element.select(".item_title").first().select("[href]").html());            list.add(forumItemBean);        }        return list;    }}
    Copy after login

    使用Jsoup遇到的坑

    在用jsoup的时候 像这种class带空格的,需要使用 element.select(“.content”).select(“.type”),才可以成功解析,使用element.select(“.content type”)是解析不出来的!

    还有 # ,这种的,使用element.select(“.content-type”)也解析不出来,需要用element.getElementsByClass(“content-type”)才可以。

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Is HTML easy to learn for beginners? Is HTML easy to learn for beginners? Apr 07, 2025 am 12:11 AM

HTML is suitable for beginners because it is simple and easy to learn and can quickly see results. 1) The learning curve of HTML is smooth and easy to get started. 2) Just master the basic tags to start creating web pages. 3) High flexibility and can be used in combination with CSS and JavaScript. 4) Rich learning resources and modern tools support the learning process.

The Roles of HTML, CSS, and JavaScript: Core Responsibilities The Roles of HTML, CSS, and JavaScript: Core Responsibilities Apr 08, 2025 pm 07:05 PM

HTML defines the web structure, CSS is responsible for style and layout, and JavaScript gives dynamic interaction. The three perform their duties in web development and jointly build a colorful website.

What is an example of a starting tag in HTML? What is an example of a starting tag in HTML? Apr 06, 2025 am 12:04 AM

AnexampleofastartingtaginHTMLis,whichbeginsaparagraph.StartingtagsareessentialinHTMLastheyinitiateelements,definetheirtypes,andarecrucialforstructuringwebpagesandconstructingtheDOM.

Understanding HTML, CSS, and JavaScript: A Beginner's Guide Understanding HTML, CSS, and JavaScript: A Beginner's Guide Apr 12, 2025 am 12:02 AM

WebdevelopmentreliesonHTML,CSS,andJavaScript:1)HTMLstructurescontent,2)CSSstylesit,and3)JavaScriptaddsinteractivity,formingthebasisofmodernwebexperiences.

Gitee Pages static website deployment failed: How to troubleshoot and resolve single file 404 errors? Gitee Pages static website deployment failed: How to troubleshoot and resolve single file 404 errors? Apr 04, 2025 pm 11:54 PM

GiteePages static website deployment failed: 404 error troubleshooting and resolution when using Gitee...

How to implement adaptive layout of Y-axis position in web annotation? How to implement adaptive layout of Y-axis position in web annotation? Apr 04, 2025 pm 11:30 PM

The Y-axis position adaptive algorithm for web annotation function This article will explore how to implement annotation functions similar to Word documents, especially how to deal with the interval between annotations...

How to use CSS3 and JavaScript to achieve the effect of scattering and enlarging the surrounding pictures after clicking? How to use CSS3 and JavaScript to achieve the effect of scattering and enlarging the surrounding pictures after clicking? Apr 05, 2025 am 06:15 AM

To achieve the effect of scattering and enlarging the surrounding images after clicking on the image, many web designs need to achieve an interactive effect: click on a certain image to make the surrounding...

Why do you need to call Vue.use(VueRouter) in the index.js file under the router folder? Why do you need to call Vue.use(VueRouter) in the index.js file under the router folder? Apr 05, 2025 pm 01:03 PM

The necessity of registering VueRouter in the index.js file under the router folder When developing Vue applications, you often encounter problems with routing configuration. Special...

See all articles