Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Programming a SAX-RSS Parser In 30 Minutes Flat

xmlreader

  • Please log in to reply
No replies to this topic

#1 bcoe

bcoe

    CC Lurker

  • Just Joined
  • Pip
  • 4 posts

Posted 19 February 2008 - 05:08 PM

Originally on my Blog.
---------------------------

A DOM XML parser — the alternative to a SAX parser — has its place. It is used for storing a PLink Blog for instance, but for streamed content on your website (like RSS) SAX is the way to go.

I can think of two important reasons as to why this is, right off the bat:

* A SAX parse can be stopped midway, so when someone attempts to stream in a 50mb changes.xml file to your site, you can throw an exception and stop the world from caving in.
* SAX is arguably a lot easier to get a handle of than DOM.

SAX is an event-driven approach to XML parsing. This allows you to quickly make tailored code to load in your favorite RFC‘s. Below you’ll be presented with all the code you need to load RSS content, but this can easily be extended to other formats like Atom.

The Code

RSSParser.java


import java.io.*;
import java.net.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
public class RSSParser extends DefaultHandler
{
//How many RSS news items should we load before stopping.
private int maximumResults=10;
/*How many elements should we allow before stopping the parse
this stops giant files from breaking the server.*/
private static final int MAX_ELEMENTS=500;
//Keep track of the current element count.
private int ecount=0;
//Keep track of the current news item count.
private int rcount=0;
private String Url="";//Url to parse.
//String to store parsed data to.
private String output="<i>Error parsing RSS feed.</i>";
//Current string being parsed.
private String currentText="";
//Current RSS News Item.
private NewsItem NI=null;
//ArrayList of all current News Items.
private ArrayList News=new ArrayList();
//Has the RSS feed's description been set yet?
boolean dSet=false;


//Constructor.
public RSSParser(String Url,int maximumResults){
super();
this.Url=Url;
this.maximumResults=maximumResults;
}


/**
Returns an HTML representation of the news feed being
parsed.
*/
public synchronized String parse(){
try{
XMLReader xr = XMLReaderFactory.createXMLReader();
xr.setContentHandler(this);
xr.setErrorHandler(this);
URL u=new URL(Url);
URLConnection UC=u.openConnection();
/*If we don't set the user-agent property sites like
Google won't let you access their feeds.*/
UC.setRequestProperty ( "User-agent", "www.plink-search.com");
InputStreamReader r = new InputStreamReader(UC.getInputStream());
xr.parse(new InputSource(r));
}catch(Exception e){
}
//Output all the parsed news items as HTML.
for(int i=0;i<News.size();i++){
output+="<div class=\"search"+(i%2)+"\">";
output+=((NewsItem)News.get(i)).toString();
output+="</div>";
}
return(output);
}



////////////////////////////////////////////////////////////////////
// Event handlers.
////////////////////////////////////////////////////////////////////
// Called when the XML file begins.
public void startDocument ()
{
}


//Called when the end of the XML file is reached.
public void endDocument ()
{
/*If we have a partially parsed news item throw it
into our array.*/
if(NI!=null){
rcount++;
News.add(NI);
}
}


//Called when we begin parsing the XML file.
public void startElement (String uri, String name,
String qName, Attributes atts) throws SAXException
{
//qName contains the non-URI name of the XML element.
if(qName.equals("item")){
if(NI!=null){
//We've fetched another news item.
rcount++;
//Add it to our ArrayList.
News.add(NI);
if(rcount==maximumResults){
//Maximum results have been reached.
throw new SAXException("\nLimit reached.");
}
}
//Create a new NewsItem to add data to.
NI=new NewsItem();
}
}


//We've reached the end of an XML element.
public void endElement (String uri, String name, String qName) throws SAXException
{
//Wait for the title information.
if(qName.equals("title")&&output.equals("<i>Error parsing RSS feed.</i>")){
output="<h3>"+currentText+" (<a href=\""+Url+"\">RSS</a>)</h3><hr />";
}else if(qName.equals("description")&&NI==null&&!dSet){
/*Add the description of the RSS feed to our
output if it hasn't yet been parsed.*/
output+="<p>"+currentText+"</p>";
dSet=true;
}if(qName.equals("title")&&NI!=null){//Are we parsing a news item?
NI.setTitle(currentText);
}else if(qName.equals("link")&&NI!=null)
NI.setURL(currentText);
else if(qName.equals("pubDate")&&NI!=null)
NI.setDate(currentText);
else if(qName.equals("description")&&NI!=null)
NI.setDescription(currentText);
//Make sure we don't attempt to parse too long of a document.
currentText="";
ecount++;
if(ecount>MAX_ELEMENTS)
throw new SAXException("\nLimit reached");
}



//Parse characters from the current element we're parsing.
public void characters (char ch[], int start, int length)
{
for(int i=start;i<start+length;i++){
currentText+=ch[i];
}
}



//Testing main method.
public static void main(String args[]){
RSSParser MyRSSParser = new RSSParser("http://www.plink-search.com/headline.xml",2);
System.out.println(MyRSSParser.parse());
}
}


The parser portion of my example code, most notably, extends on the DefaultHandler. The DefaultHandler provides the XML parsing capability. We simply create an instance of the super class (with our call to super()). In the code, we simply implement various event-oriented methods that will be called as the parse proceeds: startElement(), endElement(), startDocument(), endDocument(). These methods, when called contain information about the XML file being parsed. The qName_ variable in startElement(), endElement(), this provides the name of the element in the file being parsed — so, in the case of our RSS parser, we’re interested in title, description, pubDate, and link. The other method of note is characters(), this method incrementally provides the character data from the current element being parsed. In the example, as we parse the RSS data, we place it in a NewsItem class, when we reach the end of an item we place this class into an ArrayList — this ArrayList is outputted at the end of the parse, with additional HTML information appended.

Take note of the setRequestProperty() method. Setting the User-agent property is necessary for connecting to various sites like Google.

NewsItem.java


public class NewsItem{
private String Title="";
private String URL="";
private String Description="";
private String Date="";
//Constructor.
public NewsItem(){
}


//Set the title of the NewsItem.
public void setTitle(String Title){
this.Title=Title;
}


//Set the URL (Link) of the NewsItem.
public void setURL(String URL){
this.URL=URL;
}


//Set the description (summary) of the news item.
public void setDescription(String Description){
this.Description=Description;
}


public void setDate(String Date){
this.Date=Date;
}


//Return an HTML representation of our news story.
public String toString(){
String returnMe="";
returnMe+="<a href=\""+URL+"\">"+Title+"</a><br />";
returnMe+="<i>"+Date+"</i>";
returnMe+="<p>"+Description+"</p>";
return(returnMe);
}
}


This code, which is filled in by our example RSS parser, is fairly self explanatory. As the RSS file is parsed the setters are used to fill in the corresponding XML entries as the parse takes place — the toString() method adds some HTML to the instance variables.
Conclusion

So there you have it: getting a parser up and running for reading specifications like RSS is a fairly simple when using SAX. Using this source as an example, you should be able to throw together a parser for other XML formats. You might find it useful to extend on the code to make it more robust. You might, for instance, implement further methods to take into account error handling.

-Ben (Developer Hack Wars - The Game of Virtual Hacking)
  • 0





Also tagged with one or more of these keywords: xmlreader

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download