Using XPATH and HTML Cleaner to parse HTML/XML
The Beautiful Life of Sun Vulcan ()
This article follows the "Attribution-NonCommercial-Consistency" Creative Commons License
Please keep this sentence for reprinting: The Beautiful Life of the Sun Vulcan - This blog focuses on agile development and research on mobile and IoT devices : iOS, Android, Html5, Arduino, pcDuino, otherwise, the articles from this blog will not be reproduced or reprinted, thank you for your cooperation.
Using XPATH and HTML Cleaner to parse HTML/XML
JANUARY 5, 2010
tags: android, examples, HTML, parse, scraping, XML, XPATH
Hey everyone
Hey everyone,
So something that I've found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).
I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you're looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
public class OptionScraper {
// EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER private static final String NAME_XPATH = "//div[@class='yfi_quote']/div[@class='hd']/h2" ;
private static final String TIME_XPATH = "//table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ;
private static final String PRICE_XPATH = "//table[@id='price_table']//tr//span" ;
// TAGNODE OBJECT, ITS USE WILL COME IN LATER private static TagNode node;
// A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS) public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {
// THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE String option_url = " http://finance.yahoo.com/q?s=" name.toUpperCase();
// THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setAllowHtmlInsideAttributes( true ); props.setAllowMultiWordAttributes( true ); props.setRecognizeUnicodeChars( true ); props.setOmitComments( true );
// OPEN A CONNECTION TO THE DESIRED URL URL url = new URL(option_url); URLConnection conn = url.openConnection();
//USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT node = cleaner.clean( new InputStreamReader(conn.getInputStream()));
// ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW) Object[] info_nodes = node.evaluateXPath(NAME_XPATH); Object[] time_nodes = node.evaluateXPath(TIME_XPATH); Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);
// HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED if (info_nodes.length > 0 ) { // CASTED TO A TAGNODE TagNode info_node = (TagNode) info_nodes[ 0 ]; // HOW TO RETRIEVE THE CONTENTS AS A STRING String info = info_node.getChildren().iterator().next().toString().trim();
// SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC) processInfoNode(o, info); }
if (time_nodes.length > 0 ) { TagNode time_node = (TagNode) time_nodes[ 0 ]; String date = time_node.getChildren().iterator().next().toString().trim();
// DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE processDateNode(o, date); }
if (price_nodes.length > 0 ) { TagNode price_node = (TagNode) price_nodes[ 0 ]; double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim()); o.setPremium(price); }
return o; } } |
So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of XPATH but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.
Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.
And of course, this technique works for XML documents as well!
Hope this was helpful to everyone. Let me know if you’re confused anywhere.
- jwei