Lost Password?

  #1 (permalink)  
Old 02-19-2008, 07:08 PM
bcoe bcoe is offline
Newbie
 
Join Date: Jan 2008
Posts: 4
Rep Power: 0
bcoe is on a distinguished road
Cool Programming a SAX-RSS Parser In 30 Minutes Flat

Originally on my Blog.
---------------------------

A DOM XML parser — the alternative to a SAX parser — has its place. It is used for storing a PLink Blog for instance, but for streamed content on your website (like RSS) SAX is the way to go.

I can think of two important reasons as to why this is, right off the bat:

* A SAX parse can be stopped midway, so when someone attempts to stream in a 50mb changes.xml file to your site, you can throw an exception and stop the world from caving in.
* SAX is arguably a lot easier to get a handle of than DOM.

SAX is an event-driven approach to XML parsing. This allows you to quickly make tailored code to load in your favorite RFC‘s. Below you’ll be presented with all the code you need to load RSS content, but this can easily be extended to other formats like Atom.

The Code
Java Code:
  1. RSSParser.java
  2.  
  3.  
  4. import java.io.*;
  5. import java.net.*;
  6. import org.xml.sax.*;
  7. import org.xml.sax.helpers.*;
  8. import java.util.*;
  9. public class RSSParser extends DefaultHandler
  10. {
  11.         //How many RSS news items should we load before stopping.
  12.     private int maximumResults=10;
  13.         /*How many elements should we allow before stopping the parse
  14.           this stops giant files from breaking the server.*/
  15.     private static final int MAX_ELEMENTS=500;
  16.         //Keep track of the current element count.     
  17.     private int ecount=0;
  18.         //Keep track of the current news item count.
  19.     private int rcount=0;
  20.     private String Url="";//Url to parse.
  21.         //String to store parsed data to.
  22.     private String output="<i>Error parsing RSS feed.</i>";
  23.         //Current string being parsed.
  24.     private String currentText="";
  25.         //Current RSS News Item.
  26.     private NewsItem NI=null;
  27.         //ArrayList of all current News Items.
  28.     private ArrayList News=new ArrayList();
  29.         //Has the RSS feed's description been set yet?
  30.         boolean dSet=false;
  31.  
  32.  
  33.     //Constructor.
  34.     public RSSParser(String Url,int maximumResults){
  35.         super();
  36.         this.Url=Url;
  37.         this.maximumResults=maximumResults;
  38.     }
  39.  
  40.  
  41.     /**
  42.     Returns an HTML representation of the news feed being
  43.         parsed.
  44.     */
  45.     public synchronized String parse(){
  46.         try{
  47.             XMLReader xr = XMLReaderFactory.createXMLReader();
  48.             xr.setContentHandler(this);
  49.             xr.setErrorHandler(this);
  50.             URL u=new URL(Url);
  51.             URLConnection UC=u.openConnection();
  52.                         /*If we don't set the user-agent property sites like
  53.                           Google won't let you access their feeds.*/
  54.             UC.setRequestProperty ( "User-agent", "www.plink-search.com");
  55.             InputStreamReader r = new InputStreamReader(UC.getInputStream());
  56.             xr.parse(new InputSource(r));   
  57.         }catch(Exception e){
  58.         }
  59.                 //Output all the parsed news items as HTML.
  60.         for(int i=0;i<News.size();i++){
  61.             output+="<div class=\"search"+(i%2)+"\">";
  62.             output+=((NewsItem)News.get(i)).toString();
  63.             output+="</div>";
  64.         }
  65.         return(output);
  66.     }
  67.  
  68.  
  69.  
  70.     ////////////////////////////////////////////////////////////////////
  71.     // Event handlers.
  72.     ////////////////////////////////////////////////////////////////////
  73.     // Called when the XML file begins.
  74.     public void startDocument ()
  75.     {
  76.     }
  77.  
  78.  
  79.     //Called when the end of the XML file is reached.
  80.     public void endDocument ()
  81.     {
  82.                 /*If we have a partially parsed news item throw it
  83.                   into our array.*/
  84.         if(NI!=null){
  85.             rcount++;
  86.             News.add(NI);
  87.         }
  88.     }
  89.  
  90.  
  91.     //Called when we begin parsing the XML file.
  92.     public void startElement (String uri, String name,
  93.                   String qName, Attributes atts) throws SAXException
  94.     {
  95.                 //qName contains the non-URI name of the XML element.
  96.         if(qName.equals("item")){
  97.             if(NI!=null){
  98.                                 //We've fetched another news item.
  99.                 rcount++;
  100.                                 //Add it to our ArrayList.
  101.                 News.add(NI);
  102.                 if(rcount==maximumResults){
  103.                                         //Maximum results have been reached.
  104.                     throw new SAXException("\nLimit reached.");
  105.                                 }
  106.             }
  107.                         //Create a new NewsItem to add data to.
  108.             NI=new NewsItem();
  109.         }
  110.     }
  111.  
  112.  
  113.     //We've reached the end of an XML element.
  114.     public void endElement (String uri, String name, String qName) throws SAXException
  115.     {
  116.         //Wait for the title information.
  117.         if(qName.equals("title")&&output.equals("<i>Error parsing RSS feed.</i>")){
  118.             output="<h3>"+currentText+" (<a href=\""+Url+"\">RSS</a>)</h3><hr />";
  119.         }else if(qName.equals("description")&&NI==null&&!dSet){
  120.                         /*Add the description of the RSS feed to our
  121.                           output if it hasn't yet been parsed.*/
  122.             output+="<p>"+currentText+"</p>";
  123.             dSet=true;
  124.         }if(qName.equals("title")&&NI!=null){//Are we parsing a news item?
  125.             NI.setTitle(currentText);
  126.         }else if(qName.equals("link")&&NI!=null)
  127.             NI.setURL(currentText);
  128.         else if(qName.equals("pubDate")&&NI!=null)
  129.             NI.setDate(currentText);
  130.         else if(qName.equals("description")&&NI!=null)
  131.             NI.setDescription(currentText);
  132.         //Make sure we don't attempt to parse too long of a document.
  133.         currentText="";
  134.         ecount++;
  135.         if(ecount>MAX_ELEMENTS)
  136.             throw new SAXException("\nLimit reached");
  137.     }
  138.  
  139.  
  140.  
  141.     //Parse characters from the current element we're parsing.
  142.     public void characters (char ch[], int start, int length)
  143.     {
  144.         for(int i=start;i<start+length;i++){
  145.             currentText+=ch[i];
  146.         }
  147.     }
  148.  
  149.  
  150.  
  151.     //Testing main method.
  152.     public static void main(String args[]){
  153.         RSSParser MyRSSParser = new RSSParser("http://www.plink-search.com/headline.xml",2);
  154.         System.out.println(MyRSSParser.parse());
  155.     }
  156. }

The parser portion of my example code, most notably, extends on the DefaultHandler. The DefaultHandler provides the XML parsing capability. We simply create an instance of the super class (with our call to super()). In the code, we simply implement various event-oriented methods that will be called as the parse proceeds: startElement(), endElement(), startDocument(), endDocument(). These methods, when called contain information about the XML file being parsed. The qName_ variable in startElement(), endElement(), this provides the name of the element in the file being parsed — so, in the case of our RSS parser, we’re interested in title, description, pubDate, and link. The other method of note is characters(), this method incrementally provides the character data from the current element being parsed. In the example, as we parse the RSS data, we place it in a NewsItem class, when we reach the end of an item we place this class into an ArrayList — this ArrayList is outputted at the end of the parse, with additional HTML information appended.

Take note of the setRequestProperty() method. Setting the User-agent property is necessary for connecting to various sites like Google.
Java Code:
  1. NewsItem.java
  2.  
  3.  
  4. public class NewsItem{
  5.     private String Title="";
  6.     private String URL="";
  7.     private String Description="";
  8.     private String Date="";
  9.         //Constructor.
  10.     public NewsItem(){
  11.     }
  12.  
  13.  
  14.         //Set the title of the NewsItem.
  15.     public void setTitle(String Title){
  16.         this.Title=Title;
  17.     }
  18.  
  19.  
  20.         //Set the URL (Link) of the NewsItem.
  21.     public void setURL(String URL){
  22.         this.URL=URL;
  23.     }
  24.  
  25.  
  26.         //Set the description (summary) of the news item.
  27.     public void setDescription(String Description){
  28.         this.Description=Description;
  29.     }
  30.  
  31.  
  32.     public void setDate(String Date){
  33.         this.Date=Date;
  34.     }
  35.  
  36.  
  37.         //Return an HTML representation of our news story.
  38.     public String toString(){
  39.         String returnMe="";
  40.         returnMe+="<a href=\""+URL+"\">"+Title+"</a><br />";
  41.         returnMe+="<i>"+Date+"</i>";
  42.         returnMe+="<p>"+Description+"</p>";
  43.         return(returnMe);
  44.     }
  45. }

This code, which is filled in by our example RSS parser, is fairly self explanatory. As the RSS file is parsed the setters are used to fill in the corresponding XML entries as the parse takes place — the toString() method adds some HTML to the instance variables.
Conclusion

So there you have it: getting a parser up and running for reading specifications like RSS is a fairly simple when using SAX. Using this source as an example, you should be able to throw together a parser for other XML formats. You might find it useful to extend on the code to make it more robust. You might, for instance, implement further methods to take into account error handling.

-Ben (Developer Hack Wars - The Game of Virtual Hacking)

Last edited by Jordan; 02-20-2008 at 07:15 AM. Reason: Added code tags
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote

Sponsored Links
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
a question about game programming feariel C and C++ 8 12-05-2007 05:42 PM


All times are GMT -5. The time now is 10:37 AM.

Contest Stats

dargueta ........ 93.00000
John ........ 87.50000
Xav ........ 50.00000
MeTh0Dz ........ 20.00000
gaylo565 ........ 18.00000
Johnnyboy ........ 3.00000

Contest Rules

Ads