Jump to content

Need help with regular expressions

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
5 replies to this topic

#1
Wiizle

Wiizle

    Newbie

  • Members
  • Pip
  • 4 posts
Hi all,

I'm kind of new to getting information by retrieving webpages and get workable information from them. This is the first time I'm trying to retrieve usable information from a public website which I'm free to use for this project. I've managed to retrieve a single line of information, but I need the information in blocks. The information I'm retrieving is dynamic, so using a constant number of lines for each block is not an option.

I've made the following regex's:
<TR>\s*<TD>[\w|\s]*<\/TD>\s*<\/TR>\s*

<TR>\s*<TD>[\w|\.|\-]* <\/TD>\s*<TD><FONT COLOR=black>[\w|\s|\-]*<\/TD>\s*<TD><FONT COLOR=black>[\w|\-|\s|\.|\,]*<\/FONT><\/TD>\s*<TD><FONT COLOR=black>[\w|\s|\,]*<\/FONT><\/TD>\s*<TD><FONT COLOR=black>[\w|\s|\.|\,]*<\/FONT><\/TD>\s*<\/TR>\s*

And this is a piece of the source I'm trying to retrieve the information from:
<TABLE VALIGN=TOP CELLSPACING=4 CELLPADDING=0>

<TR><TH ALIGN=LEFT COLSPAN=5 >Werkweek 6 ma 22-03-2010</TH></TR>

<TR><TH ALIGN=LEFT> Tijd </TH><TH ALIGN=LEFT> Naam </TH><TH ALIGN=LEFT> Groepen </TH><TH ALIGN=LEFT> Ruimten </TH><TH ALIGN=LEFT> Personen </TH></TR>   <TR>


 <TD>18.30-19.15 </TD>  <TD><FONT COLOR=black>CMD-2 deeltijd HC Interaction Design</TD>  <TD><FONT COLOR=black>CMD-2dt-p3</FONT></TD>  <TD><FONT COLOR=black>OVk45</FONT></TD>  <TD><FONT COLOR=black>A.J. Reurings</FONT></TD>    </TR><P>

   <TR>

 <TD>19.15-21.00 </TD>  <TD><FONT COLOR=black>CMD-2 deeltijd WS Interaction Design</TD>  <TD><FONT COLOR=black>CMD-2dt-p3.01, CMD-2dt-p3.02, CMD-2dt-p3.03, CMD-2dt-p3.04, CMD-2dt-p3.05, CMD-2dt-p3.06</FONT></TD>  <TD><FONT COLOR=black>SL431</FONT></TD>  <TD><FONT COLOR=black>A. Reuneker</FONT></TD>    </TR><P>


   <TR>

 <TD>21.00-21.45 </TD>  <TD><FONT COLOR=black>CMD-2 deeltijd Projectoverleg maandag</TD>  <TD><FONT COLOR=black>CMD-2dt-p3.05, CMD-2dt-p3.06</FONT></TD>  <TD><FONT COLOR=black>SL433</FONT></TD>  <TD><FONT COLOR=black></FONT></TD>    </TR><P>

<TR><TH COLSPAN=5> <HR> </TH></TR>

<P>


<TR><TH ALIGN=LEFT COLSPAN=5 >Werkweek 8 ma 05-04-2010</TH></TR>

   <TR>

 <TD>Pasen</TD>    </TR><P>

<TR><TH COLSPAN=5> <HR> </TH></TR>

<P>

<P>

<TR><TH ALIGN=LEFT COLSPAN=5 >Werkweek 9 wo 14-04-2010</TH></TR>

<TR><TH ALIGN=LEFT> Tijd </TH><TH ALIGN=LEFT> Naam </TH><TH ALIGN=LEFT> Groepen </TH><TH ALIGN=LEFT> Ruimten </TH><TH ALIGN=LEFT> Personen </TH></TR>   <TR>

 <TD>18.30-22.30 </TD>  <TD><FONT COLOR=black>CMD-2 deeltijd Assessment</TD>  <TD><FONT COLOR=black>CMD-2dt-p3.01, CMD-2dt-p3.02, CMD-2dt-p3.03, CMD-2dt-p3.04, CMD-2dt-p3.05, CMD-2dt-p3.06, CMD-2dt-p3.07, CMD-2dt-p3.08, CMD-2dt-p3.09, CMD-2dt-p3.10, CMD-2dt-p3.11, CMD-2dt-p3.12</FONT></TD>  <TD><FONT COLOR=black>SL434, SL454, SL845</FONT></TD>  <TD><FONT COLOR=black>P.J.G. Deters, J.P. van der Linden, K.J. van Oenen, L. van Noorden, T. Zweers</FONT></TD>    </TR><P>

<TR><TH COLSPAN=5> <HR> </TH></TR>

</TABLE>

I'm trying to get the information in blocks like these:
<TR><TH ALIGN=LEFT COLSPAN=5 >Werkweek 6 ma 22-03-2010</TH></TR>

<TR><TH ALIGN=LEFT> Tijd </TH><TH ALIGN=LEFT> Naam </TH><TH ALIGN=LEFT> Groepen </TH><TH ALIGN=LEFT> Ruimten </TH><TH ALIGN=LEFT> Personen </TH></TR>   <TR>


 <TD>18.30-19.15 </TD>  <TD><FONT COLOR=black>CMD-2 deeltijd HC Interaction Design</TD>  <TD><FONT COLOR=black>CMD-2dt-p3</FONT></TD>  <TD><FONT COLOR=black>OVk45</FONT></TD>  <TD><FONT COLOR=black>A.J. Reurings</FONT></TD>    </TR><P>

   <TR>

 <TD>19.15-21.00 </TD>  <TD><FONT COLOR=black>CMD-2 deeltijd WS Interaction Design</TD>  <TD><FONT COLOR=black>CMD-2dt-p3.01, CMD-2dt-p3.02, CMD-2dt-p3.03, CMD-2dt-p3.04, CMD-2dt-p3.05, CMD-2dt-p3.06</FONT></TD>  <TD><FONT COLOR=black>SL431</FONT></TD>  <TD><FONT COLOR=black>A. Reuneker</FONT></TD>    </TR><P>


   <TR>

 <TD>21.00-21.45 </TD>  <TD><FONT COLOR=black>CMD-2 deeltijd Projectoverleg maandag</TD>  <TD><FONT COLOR=black>CMD-2dt-p3.05, CMD-2dt-p3.06</FONT></TD>  <TD><FONT COLOR=black>SL433</FONT></TD>  <TD><FONT COLOR=black></FONT></TD>    </TR><P>

<TR><TH COLSPAN=5> <HR> </TH></TR>

<P>

<TR><TH ALIGN=LEFT COLSPAN=5 >Werkweek 8 ma 05-04-2010</TH></TR>

   <TR>

 <TD>Pasen</TD>    </TR><P>

<TR><TH COLSPAN=5> <HR> </TH></TR>

<TR><TH ALIGN=LEFT COLSPAN=5 >Werkweek 9 wo 14-04-2010</TH></TR>

<TR><TH ALIGN=LEFT> Tijd </TH><TH ALIGN=LEFT> Naam </TH><TH ALIGN=LEFT> Groepen </TH><TH ALIGN=LEFT> Ruimten </TH><TH ALIGN=LEFT> Personen </TH></TR>   <TR>

 <TD>18.30-22.30 </TD>  <TD><FONT COLOR=black>CMD-2 deeltijd Assessment</TD>  <TD><FONT COLOR=black>CMD-2dt-p3.01, CMD-2dt-p3.02, CMD-2dt-p3.03, CMD-2dt-p3.04, CMD-2dt-p3.05, CMD-2dt-p3.06, CMD-2dt-p3.07, CMD-2dt-p3.08, CMD-2dt-p3.09, CMD-2dt-p3.10, CMD-2dt-p3.11, CMD-2dt-p3.12</FONT></TD>  <TD><FONT COLOR=black>SL434, SL454, SL845</FONT></TD>  <TD><FONT COLOR=black>P.J.G. Deters, J.P. van der Linden, K.J. van Oenen, L. van Noorden, T. Zweers</FONT></TD>    </TR>

<TR><TH COLSPAN=5> <HR> </TH></TR>

I hope there's someone to help me figure this out.

Thanks!

#2
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
It looks like you need to use parenthesis instead of brackets.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#3
Wiizle

Wiizle

    Newbie

  • Members
  • Pip
  • 4 posts
Hi WingedPather,

Thank you for your reponse. The first thing I tried was using brackets. My final and best try on getting the blocks, is the regex I've created below.

/<TR><TH ALIGN=LEFT COLSPAN=5 >Werkweek [0-9]* [a-z][a-z] [0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]<\/TH><\/TR>\s*(<TR><TH ALIGN=LEFT> Tijd <\/TH><TH ALIGN=LEFT> Naam <\/TH><TH ALIGN=LEFT> Groepen <\/TH><TH ALIGN=LEFT> Ruimten <\/TH><TH ALIGN=LEFT> Personen <\/TH><\/TR>|<TR>\s*<TD>(\w|\s)*<\/TD>\s*<\/TR>)*\s*(<TR>\s*<TD>(\w|\.|\-)* <\/TD>\s*<TD><FONT COLOR=black>(\w|\s|\-)*<\/TD>\s*<TD><FONT COLOR=black>[\w|\-|\s|\.|\,]*<\/FONT><\/TD>\s*<TD><FONT COLOR=black>(\w|\s|\,)*<\/FONT><\/TD>\s*<TD><FONT COLOR=black>(\w|\s|\.|\,)*<\/FONT><\/TD>\s*<\/TR>)*\s*/

But this one doesn't retrieve all entries of a day, but only the first one.
If anyone is able to give a push in the right direction, then please! :)

EDIT: This regex also find a lot of tiny things that I can't explain when using the parentheses. If I use the brackets, the results are more specific.
Thanks again!

#4
Deadlock

Deadlock

    Learning Programmer

  • Members
  • PipPipPip
  • 81 posts
Do you want to extract the information inside the table without the html tags? eg:

<TR><TH ALIGN=LEFT COLSPAN=5 >Werkweek 9 wo 14-04-2010</TH></TR>
should be:
Werkweek 9 wo 14-04-2010

More explanation will help us solve your problem.

#5
Wiizle

Wiizle

    Newbie

  • Members
  • Pip
  • 4 posts
Hi Deadlock,

Removing the html tags is not a problem. I'd like to have the information in the blocks like in my first post.
So the blocks should contain html. In case you want to know what they exactly should look like, please check the example results in my first post.

Thanks!

#6
Wiizle

Wiizle

    Newbie

  • Members
  • Pip
  • 4 posts
I've been trying to finish it myself, but no luck yet. Would it even be possible to do this with regex? I thought it might have something to do with a finishing line. I'm still hoping for someone to push me in the right direction!