Jump to content

pattern recognition, template generator

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
1 reply to this topic

#1
bartonski

bartonski

    Newbie

  • Members
  • Pip
  • 2 posts
Well, it's a lazy Sunday afternoon, a perfect time to noodle over a problem that's been bothering me for years.

I'm looking for an algorithm which will take an arbitrary piece of repeating text and create a template which shows which parts are constant and which parts repeat. I'll give several examples:

fixed width text:


Fred            Mbago   27 Fleet St.    Florence        MI      30444

Terrence        Swinge  100 Fnox Ln.    Ann Arbor       MI      28540

Bastian         Bux     54 Fnoo Ct.     Louisville      KY      40333


The thing that remains constant between lines is that there is always a certain amount of white space between columns, for example between the first and last name, there's always at least 8 spaces ('Terrence' is the longest first name, there are 8 characters between that and 'Swinge').

In this case, I would like the algorithm to create a pattern which shows the characters on a single line which vary, and which ones remain constant, perhaps the following regular expression:

(........)        (......)  (............)    (..........)      (..)      (.....)


The dot character represents a single character, the spaces are literal, and the parentheses group data that is to be captured.

The same data, but comma delimited would look like this:

Fred,,Mbago,27 Fleet St.,Florence,MI,30444

Terrence,,Swinge,100 Fnox Ln.,Ann Arbor,MI,28540

Bastian,,Bux,54 Fnoo Ct.,Louisville,KY,40333


the regular expression that would match one of these lines would look like this:

(.+),,(.+),(.+),(.+),(..),(.+)

finally, a multi-line pattern might look like this:

Address:

        Fred

        Mbago

        27 Fleet St.

        Florence

        MI

        30444

Address:

        Terrence

        Swinge

        100 Fnox Ln.

        Ann Arbor

        MI

        28540

Address:

        Bastian

        Bux

        54 Fnoo Ct.

        Louisville

        KY

        40333


The regular expression for capturing one of these stanzas might look like this:

/Address:\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n/m

The algorithm doesn't have to produce regular expressions per se; I use them here as an illustration... simply outputting one type of character for something that varies and a literal for something that stays constant would be sufficient... but it is important to me to be able to identifiy patterns which span more than a single line of data.

I'm thinking that some sort of stochastic algorithm would be good... genetic algorithm or simulated annealing maybe... break down the originating text into small blocks, start putting blocks together at random, and see how much will fit the text as a whole...

I'm not stuck on writing something like this myself, although I will if I have to; if someone can point me to the right key-words on google, that will be fine; I just don't exactly know what to look for right now.

Anyway... something to chew on.

#2
fahlyn

fahlyn

    Learning Programmer

  • Members
  • PipPipPip
  • 35 posts
I'm assuming that you want this to be dynamic enough to be able to handle "any" pattern fed into this template generator without having to select delimiters or anything like that. I think, in order to do that, you'll first need to define the different types of characters that a pattern could contain. You've got alphanumeric characters, special characters (!,@,#,$,%...), and other special characters (\n,\t,\r,\s...). I'm not sure what the best way would be to do this, but just off the top of my head, it seems like you need to go through record by record and keep track of the position and character type of each character in the record. Then, as you go through each row in your file compare the positions and character types to the previously tracked positions and character types.

Anyway, I've never had the opportunity to use them, but Perl's formats might be something that would help in doing something like this.
Visit My Google Group Here: Web Development Innovation