Lost Password?

Go Back   CodeCall Programming Forum > Software Development > General Programming > Programming Theory

Programming Theory Discuss programming theory, algorithm efficiency, logic, and other any other category where math and computer science overlap.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 01-13-2008, 04:05 PM
bartonski bartonski is offline
Newbie
 
Join Date: Jan 2008
Posts: 2
Rep Power: 0
bartonski is on a distinguished road
Default pattern recognition, template generator

Well, it's a lazy Sunday afternoon, a perfect time to noodle over a problem that's been bothering me for years.

I'm looking for an algorithm which will take an arbitrary piece of repeating text and create a template which shows which parts are constant and which parts repeat. I'll give several examples:

fixed width text:

Code:
Fred            Mbago   27 Fleet St.    Florence        MI      30444
Terrence        Swinge  100 Fnox Ln.    Ann Arbor       MI      28540
Bastian         Bux     54 Fnoo Ct.     Louisville      KY      40333
The thing that remains constant between lines is that there is always a certain amount of white space between columns, for example between the first and last name, there's always at least 8 spaces ('Terrence' is the longest first name, there are 8 characters between that and 'Swinge').

In this case, I would like the algorithm to create a pattern which shows the characters on a single line which vary, and which ones remain constant, perhaps the following regular expression:

Code:
(........)        (......)  (............)    (..........)      (..)      (.....)
The dot character represents a single character, the spaces are literal, and the parentheses group data that is to be captured.

The same data, but comma delimited would look like this:

Code:
Fred,,Mbago,27 Fleet St.,Florence,MI,30444
Terrence,,Swinge,100 Fnox Ln.,Ann Arbor,MI,28540
Bastian,,Bux,54 Fnoo Ct.,Louisville,KY,40333
the regular expression that would match one of these lines would look like this:
Code:
(.+),,(.+),(.+),(.+),(..),(.+)
finally, a multi-line pattern might look like this:

Code:
Address:
        Fred
        Mbago
        27 Fleet St.
        Florence
        MI
        30444
Address:
        Terrence
        Swinge
        100 Fnox Ln.
        Ann Arbor
        MI
        28540
Address:
        Bastian
        Bux
        54 Fnoo Ct.
        Louisville
        KY
        40333
The regular expression for capturing one of these stanzas might look like this:

Code:
/Address:\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n\s+(.+)\n/m
The algorithm doesn't have to produce regular expressions per se; I use them here as an illustration... simply outputting one type of character for something that varies and a literal for something that stays constant would be sufficient... but it is important to me to be able to identifiy patterns which span more than a single line of data.

I'm thinking that some sort of stochastic algorithm would be good... genetic algorithm or simulated annealing maybe... break down the originating text into small blocks, start putting blocks together at random, and see how much will fit the text as a whole...

I'm not stuck on writing something like this myself, although I will if I have to; if someone can point me to the right key-words on google, that will be fine; I just don't exactly know what to look for right now.

Anyway... something to chew on.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote

Sponsored Links
  #2 (permalink)  
Old 02-14-2008, 07:23 AM
fahlyn's Avatar   
fahlyn fahlyn is offline
Learning Programmer
 
Join Date: Nov 2007
Age: 25
Posts: 35
Rep Power: 4
fahlyn is on a distinguished road
Send a message via ICQ to fahlyn Send a message via AIM to fahlyn Send a message via MSN to fahlyn Send a message via Yahoo to fahlyn
Default

I'm assuming that you want this to be dynamic enough to be able to handle "any" pattern fed into this template generator without having to select delimiters or anything like that. I think, in order to do that, you'll first need to define the different types of characters that a pattern could contain. You've got alphanumeric characters, special characters (!,@,#,$,%...), and other special characters (\n,\t,\r,\s...). I'm not sure what the best way would be to do this, but just off the top of my head, it seems like you need to go through record by record and keep track of the position and character type of each character in the record. Then, as you go through each row in your file compare the positions and character types to the previously tracked positions and character types.

Anyway, I've never had the opportunity to use them, but Perl's formats might be something that would help in doing something like this.
__________________
Visit My Google Group Here: Web Development Innovation
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
Template matching in Matlab? tommy_chai General Programming 0 12-16-2007 03:36 AM
Template Sites TVDinner Website Design 4 06-01-2007 10:32 PM
asp.net template help streulich ASP, ASP.NET and Coldfusion 4 12-01-2006 07:55 PM


All times are GMT -5. The time now is 05:57 PM.

Contest Stats

Xav ........ 161.68
neerlin ........ 100
satrian ........ 100
delia ........ 100
chili5 ........ 70.08
morefood2001 ........ 42.41
MeTh0Dz|Reb0rn ........ 28.44
RyanTuosto ........ 20
gamiR ........ 19.64
John ........ 14.46

Contest Rules

CodeCall Goal

Goal: 100,000 Posts
Complete: 68%

Ads