|
RiTa index |
||||||||||||||||||||||||||||
| Name | RiHtmlParser | |||||||||||||||||||||||||||
| Description | Provides various utility functions for fetching and parsing text data from web pages
using either the Document-Object-Model (DOM) or regular expressions. Parses an HTML document and returns the html text, with or without the HTML tags stripped. Can also be used for custom parsing, as in the fetchLinks() and fetchLinkText() methods (see example below.) Simple Examples:
RiHtmlParser rhp = new RiHtmlParser();
System.out.println(rhp.fetch("http://www.google.com")); // simple fetch
// -------------------------------------------------------------------
System.out.println(rhp.fetch("http://www.google.com", true)); // fetch & strip
// -------------------------------------------------------------------
String[] links = rhp.fetchLinks("http://www.google.com"); // get links
for (int i = 0; i < links.length; i++) // & print 'em
System.out.println(i+") "+links[i]); // one by one
// -------------------------------------------------------------------
System.out.println(rhp.parse("http://www.google.com")); // an empty parse
Also provides a base implementation so that subclasses can override the handleText(), handleSimpleTag(), handleStartTag(), and handleEndTag(), methods to define custom behavior (as below and in RiGoogleParser). An example of a custom parse to retrieve all linked text:
final List links = new ArrayList();
rhp.customParse(new URL("http://www.google.com"),
new HTMLEditorKit.ParserCallback() // an inner class
{
boolean isLink = false;
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
if (t == Tag.A) isLink = true;
}
public void handleText(char[] data, int pos) {
if (isLink) links.add(new String(data));
}
public void handleEndTag(Tag t, int pos) {
if (t == Tag.A) isLink = false;
}
}
));
// print out the link texts that we found
for (int i = 0; i < links.size(); i++)
System.out.println(i+") "+links.get(i));
|
|||||||||||||||||||||||||||
| Constructors | RiHtmlParser(p); |
|||||||||||||||||||||||||||
| Methods |
|
|||||||||||||||||||||||||||
| Usage | Web & Application |