I'm looking for an algorithm which will take an arbitrary piece of repeating text and create a template which shows which parts are constant and which parts repeat. I'll give several examples:
fixed width text:
Fred Mbago 27 Fleet St. Florence MI 30444 Terrence Swinge 100 Fnox Ln. Ann Arbor MI 28540 Bastian Bux 54 Fnoo Ct. Louisville KY 40333
The thing that remains constant between lines is that there is always a certain amount of white space between columns, for example between the first and last name, there's always at least 8 spaces ('Terrence' is the longest first name, there are 8 characters between that and 'Swinge').
In this case, I would like the algorithm to create a pattern which shows the characters on a single line which vary, and which ones remain constant, perhaps the following regular expression:
(........) (......) (............) (..........) (..) (.....)
The dot character represents a single character, the spaces are literal, and the parentheses group data that is to be captured.
The same data, but comma delimited would look like this:
Fred,,Mbago,27 Fleet St.,Florence,MI,30444 Terrence,,Swinge,100 Fnox Ln.,Ann Arbor,MI,28540 Bastian,,Bux,54 Fnoo Ct.,Louisville,KY,40333
the regular expression that would match one of these lines would look like this:
(.+),,(.+),(.+),(.+),(..),(.+)
finally, a multi-line pattern might look like this:
Address: Fred Mbago 27 Fleet St. Florence MI 30444 Address: Terrence Swinge 100 Fnox Ln. Ann Arbor MI 28540 Address: Bastian Bux 54 Fnoo Ct. Louisville KY 40333
The regular expression for capturing one of these stanzas might look like this:
/Address:\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n/m
The algorithm doesn't have to produce regular expressions per se; I use them here as an illustration... simply outputting one type of character for something that varies and a literal for something that stays constant would be sufficient... but it is important to me to be able to identifiy patterns which span more than a single line of data.
I'm thinking that some sort of stochastic algorithm would be good... genetic algorithm or simulated annealing maybe... break down the originating text into small blocks, start putting blocks together at random, and see how much will fit the text as a whole...
I'm not stuck on writing something like this myself, although I will if I have to; if someone can point me to the right key-words on google, that will be fine; I just don't exactly know what to look for right now.
Anyway... something to chew on.


Sign In
Create Account

Back to top









