RiTa
index
Name RiHtmlParser
Description Provides various utility functions for fetching and parsing text data from web pages using either the Document-Object-Model (DOM) or regular expressions.

Parses an HTML document and returns the html text, with or without the HTML tags stripped. Can also be used for custom parsing, as in the fetchLinks() and fetchLinkText() methods (see example below.)

Simple Examples:

     RiHtmlParser rhp = new RiHtmlParser();
     System.out.println(rhp.fetch("http://www.google.com"));  // simple fetch
     // -------------------------------------------------------------------
     System.out.println(rhp.fetch("http://www.google.com", true));  // fetch & strip
     // -------------------------------------------------------------------
     String[] links = rhp.fetchLinks("http://www.google.com");  // get links     
     for (int i = 0; i < links.length; i++)                    // & print 'em
       System.out.println(i+") "+links[i]);                   // one by one
     // -------------------------------------------------------------------
     System.out.println(rhp.parse("http://www.google.com"));  // an empty parse
     

Also provides a base implementation so that subclasses can override the handleText(), handleSimpleTag(), handleStartTag(), and handleEndTag(), methods to define custom behavior (as below and in RiGoogleParser).

An example of a custom parse to retrieve all linked text:

     final List links = new ArrayList();
     rhp.customParse(new URL("http://www.google.com"), 
       new HTMLEditorKit.ParserCallback() // an inner class
       {
         boolean isLink = false;
         public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
           if (t == Tag.A) isLink = true;
         }
         public void handleText(char[] data, int pos) {
           if (isLink) links.add(new String(data));        
         }
         public void handleEndTag(Tag t, int pos) {
           if (t == Tag.A) isLink = false;
         }
       }
     ));

     // print out the link texts that we found 
     for (int i = 0; i < links.size(); i++)
       System.out.println(i+") "+links.get(i)); 
    
Constructors
RiHtmlParser(p);
Methods
customParse()   Returns the contents of the URL (an html page) after executing the callbacks defined in the ParserCallback object.

fetch()   Fetches the contents of the URL (an html page) with all HTML tags removed as specified by the stripTags flag.

fetchLinks()   Returns a String array of text, one element per link on the page

fetchLinkText()   Returns a String array of text, one element per link on the page

handleEndTag()   To be overriden in subclasses

handleSimpleTag()   To be overriden in subclasses

handleStartTag()   To be overriden in subclasses

handleText()   To be overriden in subclasses

parse()   Returns the contents of the URL (an html page) with HTML tags removed after executing the following callbacks (which should be overridden):
  • handleText()
  • handleSimpleTag()
  • handleStartTag()
  • handleEndTag()


Usage Web & Application