View Single Post
  #1 (permalink)  
Old 01-13-2008, 04:05 PM
bartonski bartonski is offline
Newbie
 
Join Date: Jan 2008
Posts: 2
Credits: 0
Rep Power: 0
bartonski is on a distinguished road
Default pattern recognition, template generator

Well, it's a lazy Sunday afternoon, a perfect time to noodle over a problem that's been bothering me for years.

I'm looking for an algorithm which will take an arbitrary piece of repeating text and create a template which shows which parts are constant and which parts repeat. I'll give several examples:

fixed width text:

Code:
Fred            Mbago   27 Fleet St.    Florence        MI      30444
Terrence        Swinge  100 Fnox Ln.    Ann Arbor       MI      28540
Bastian         Bux     54 Fnoo Ct.     Louisville      KY      40333
The thing that remains constant between lines is that there is always a certain amount of white space between columns, for example between the first and last name, there's always at least 8 spaces ('Terrence' is the longest first name, there are 8 characters between that and 'Swinge').

In this case, I would like the algorithm to create a pattern which shows the characters on a single line which vary, and which ones remain constant, perhaps the following regular expression:

Code:
(........)        (......)  (............)    (..........)      (..)      (.....)
The dot character represents a single character, the spaces are literal, and the parentheses group data that is to be captured.

The same data, but comma delimited would look like this:

Code:
Fred,,Mbago,27 Fleet St.,Florence,MI,30444
Terrence,,Swinge,100 Fnox Ln.,Ann Arbor,MI,28540
Bastian,,Bux,54 Fnoo Ct.,Louisville,KY,40333
the regular expression that would match one of these lines would look like this:
Code:
(.+),,(.+),(.+),(.+),(..),(.+)
finally, a multi-line pattern might look like this:

Code:
Address:
        Fred
        Mbago
        27 Fleet St.
        Florence
        MI
        30444
Address:
        Terrence
        Swinge
        100 Fnox Ln.
        Ann Arbor
        MI
        28540
Address:
        Bastian
        Bux
        54 Fnoo Ct.
        Louisville
        KY
        40333
The regular expression for capturing one of these stanzas might look like this:

Code:
/Address:\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n/m
The algorithm doesn't have to produce regular expressions per se; I use them here as an illustration... simply outputting one type of character for something that varies and a literal for something that stays constant would be sufficient... but it is important to me to be able to identifiy patterns which span more than a single line of data.

I'm thinking that some sort of stochastic algorithm would be good... genetic algorithm or simulated annealing maybe... break down the originating text into small blocks, start putting blocks together at random, and see how much will fit the text as a whole...

I'm not stuck on writing something like this myself, although I will if I have to; if someone can point me to the right key-words on google, that will be fine; I just don't exactly know what to look for right now.

Anyway... something to chew on.
Reply With Quote

Sponsored Links