Jump to content

Couple of regexp issues

- - - - -

  • Please log in to reply
1 reply to this topic

#1
onething

onething

    Programmer

  • Members
  • PipPipPipPip
  • 118 posts
1. I can't quite seem to catch the question mark in the first line. And it inexplicably captures the second string too, which doesnt make any sense to me as I thought I had the lookbehind covered. Here are the two lines.



<H3><A href="/question/index;_ylt=AuceFBRGAkkNJn5iiu3ZDYYjzKIX;_ylv=3?qid=20070704123624AA9H28e"><STRONG class=highlight>Accountant</STRONG>?</A></H3>

<P>...to do to get into university to be an <STRONG class=highlight>accountant</STRONG> ? what requirement do I need? how about the average...</P>



with:



(?<=<H3><A href="/question/index;\w+={[a-z, A-Z, 0-9]*_[a-z, A-Z, 0-9]|[a-z, A-Z, 0-9]*}*;_\w+=\d\?\w+=[a-z, A-Z, 0-9]*">){[a-z, A-Z, 0-9]* <STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG> [a-z, A-Z, 0-9]*\?|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>[a-z, A-Z, 0-9]*|[a-z, A-Z, 0-9]*<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>}?(?=\<\</A></H3>)



in order to get only first line. However I get both, not as one match but as two. It baffles me as to why I'm picking up the second line up too because it's clearly aimed only at the first line. What on earth leads it to think it's found a <P> when clearly there's just an <A> beats me... I can't help thinking the issue may be somewhere in the spin but the more I look the more I feel I'm gonna go nuts. To my mind all the stuff inside the spin doesnt lead off to some error at all. And I've put the A and the H3 there, glaringly so and yet it still matches it all. The key is in what it captures of the second line:

Quote

----------------------------------- match # 0 -----------------------------------
<STRONG class=highlight>Accountant</STRONG>
----------------------------------- match # 1 -----------------------------------
to do to get into university to be an <STRONG class=highlight>accountant</STRONG>



It seems to think there's an (?=\<\</A></H3>) after that </STRONG> but all there is is a space bar, and besides, when I leave only letters with no space bars it comes up with the same result. And there's certainly no H3 to be seen, so I dunno what match 1 is referring to.



All the spin inside is because I'm matching variations in a bigger file, which I've got covered. I'm also surprised I'm not picking up the question mark in the real text at the end of what I'm looking for. I've tried sticking it all over the place, inside the spin, outside, with and without line breaks, to no avail. Would appreciate a hand, thanks.


2. I'm having trouble with line breaks, trying to match line breaks of a certain kind. I'd like to match all the strings that are before other strings that have phrases like '0 stars', '1 star' and so on.



An example of this is the following:



Quote

Sue an accountant who filed your taxes incorrectly when penalty is involved?

An accountant who handled...it ok to sue the accountant for the penalty?

0 Stars In United States - Asked by monaya - 6 answers - 3 years ago


I want the line in the middle. So I thought about the following:



(?<=\^)$(?=\^\\d)

without an inexplicable excape double in front of the d: (?<=\^)$(?=\^\d)

but it doesn't work. I tried a ? in front of the line break, like this:

 (?<=\?\^)$(?=\^\\d)

and without the escape, but that didn't work either. What am i doing wrong?

#2
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,721 posts
  • Programming Language:C, Java, C++, PHP, Python, Perl, Assembly, Bash, Others
  • Learning:JavaScript
Is this Perl, pcregrep, sed, or something else? It depends on that.
sudo rm -rf /




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users