Jump to content

Regex issue

- - - - -

  • Please log in to reply
4 replies to this topic

#1
wim DC

wim DC

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 2,084 posts
  • Programming Language:Java, JavaScript, PL/SQL
  • Learning:Java
Simple thing I want: regex which finds all Java keywords in a multiline String.
Pretty simple I thought:
abstract|continue|for|new|switch|assert|default|goto|package|synchronized|boolean|do|if|private|this|break|double|implements|protected|throw|byte|else|import|public|throws|case|enum|instanceof|return|transient|catch|extends|int|short|try|char|final|interface|static|void|class|finally|long|strictfp|volatile|const|float|native|super|while
Okay, not the prettiest regex, but impossible to improve and it gets the job done... not.

Because obviously this will also find the 'int' in 'print' and the 'if' in 'sniff' etc.
So I figured, every keyword must follow:
1) a space (assume tabs have been converted to spaces)
2) nothing(start of line)
And every keyword must be followed by a space

So the long regex becomes REALLY long now :P
(^|\s)abstract\s|(^|\s)continue\s|(^|\s)for\s|(^|\s)new\s|(^|\s)switch\s|(^|\s)assert\s|(^|\s)default\s|(^|\s)goto\s|(^|\s)package\s|(^|\s)synchronized\s|(^|\s)boolean\s|(^|\s)do\s|(^|\s)if\s|(^|\s)private\s|(^|\s)this\s|(^|\s)break\s|(^|\s)double\s|(^|\s)implements\s|(^|\s)protected\s|(^|\s)throw\s|(^|\s)byte\s|(^|\s)else\s|(^|\s)import\s|(^|\s)public\s|(^|\s)throws\s|(^|\s)case\s|(^|\s)enum\s|(^|\s)instanceof\s|(^|\s)return\s|(^|\s)transient\s|(^|\s)catch\s|(^|\s)extends\s|(^|\s)int\s|(^|\s)short\s|(^|\s)try\s|(^|\s)char\s|(^|\s)final\s|(^|\s)interface\s|(^|\s)static\s|(^|\s)void\s|(^|\s)class\s|(^|\s)finally\s|(^|\s)long\s|(^|\s)strictfp\s|(^|\s)volatile\s|(^|\s)const\s|(^|\s)float\s|(^|\s)native\s|(^|\s)super\s|(^|\s)while\s|".*?"

And this works BUT not when 2 keywords follow each other. And that's where I need your help ^^
for example:
public static final
3 keywords in a row.
public is found, because it starts with nothing(start of line) and is followed by a space. But then it does not find the 'static' because it has got no leading space (as the space is already "taken" by the 'public' keyword).

Any ideas?




Note: I actually tried to run trough the String word by word, after having split it into an array. But the long regex I got now, which works fine apart from not finding 2 in a row is 3 times as fast than looping so...

#2
gregwarner

gregwarner

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 853 posts
  • Location:Arkansas
Check out this page from "Engineering a Compiler" by Keith Cooper and Linda Torczon for a quick little synopsis on whether to implement the keyword checker directly into the DFA or to treat them as identifiers in the DFA and use a Hash Table for keyword lookup:
Engineering a Compiler - Keith Cooper, Linda Torczon - Google Books

Perhaps that'll help your decision.

Edit: You need to scroll up a tiny bit from where that link lands you, to get to the beginning of the relevant section: (§ 2.5.4)
Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.

– Douglas Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid


#3
wim DC

wim DC

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 2,084 posts
  • Programming Language:Java, JavaScript, PL/SQL
  • Learning:Java
Hmm reading about that HashTable over there led me to trying HashSet (I was actually using ArrayList at first, not even sorting and binarysearch, noooo just .contains(..) :D, no wonder that was slow)
And daaamn, a HashSet is fast :P

I'll see if I can do something with a HashSet....

#4
lespauled

lespauled

    Programming Professional

  • Members
  • PipPipPipPipPip
  • 231 posts
  • Programming Language:C, C++, C#, JavaScript, PL/SQL, Delphi/Object Pascal, Visual Basic .NET, Pascal, Transact-SQL, Bash
Have you tried \b( your java keyword list )\b

#5
wim DC

wim DC

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 2,084 posts
  • Programming Language:Java, JavaScript, PL/SQL
  • Learning:Java

lespauled said:

Have you tried \b( your java keyword list )\b
Orrr, I could use this easy solution :D

Thanks mate.




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users