Jump to content

Issue parsing a non alphanumeric value with regexp

- - - - -

  • Please log in to reply
1 reply to this topic

#1
onething

onething

    Programmer

  • Members
  • PipPipPipPip
  • 118 posts
Hi, when scraping google with something called zennoposter, i need regexp to parse. I'm using a regexp that works fine for alphanumeric values:

(?<=\'\)\" href\=\").*?(?=\"\>\<EM\>)

(?<=\" href\=\").*?(?=\<\/A\>\<\/H3\>\<BUTTON class\=vspib type\=submit\>\<\/BUTTON\>\<\/SPAN\>) 

But when I need things like

inurl:

or

/

values like / or : come into the equation that seem to be breaking this value and making regexp think it's ended value and that it's now time to end what it's looking for. So the value goes on but it's not making any sense to the regexp as it's now expecting stuff that's found after the value, but it's getting alphanumeric values instead.

If anyone knowing regexp can tell me how to isolate these non alpha numeric values to get the regexp to think it's part of the value to be parsed. I'd appreciate it. thanks.

#2
onething

onething

    Programmer

  • Members
  • PipPipPipPip
  • 118 posts
i realised this issue was nonsense. My regexp is still rubbish because it's only parsing a few google results and not all, but that's for me to find out as I'm the one with the page text to parse, but this is clearly a html tag issue and not a regexp one as I state in the following thread:

http://forum.codecal...html#post299445




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users