Hi,

I found an O(NP) Diff algorithm for finding the edit distance and the number of deletes.

However, I am trying to modify the algorithm to output a whole edit script in terms of deletes, inserts and unchanged characters or sequences.

Here is the algorithm:

http://research.janelia.org/myers/Papers/np_diff.pdf

And here it is in code:

comp.programming: An O(NP) Sequence Comparison Algorithm

I have not been able to find an easy way to modify the algorithm, especially while maintaining o(NP). Actually I am not even sure it is possible to modify this algorithm to output a whole edit script. I know it's a greedy algorithm, but I do not understand it fully.

In the version I have posted it outputs edit distance but it can be trivially modified to to output number of deletions, as these quantities are simply related.

Any help in getting an edit script algo, or if you can suggest an alternative algorithm, I would be grateful. I am looking for a fast and accurate algorithm that can deal with very large documents with significant differences.

Many free or commercial algorithms I have tried fail badly on very large documents, reducing them to a few massive delete and insert blocks. And I know better performance is possible with those documents as I have seen it used in some specialised red-line features of some software. Unfortunately those are not available for me. What I am really looking for is a DLL or code, commercial or free. A console application might also be acceptable as long as the output is simple and can be used by another program.

Thanks in advance! I really need the help!