So I had this idea for a lossless compression algorithm the other day, and I was wondering what others would think of it. It really just takes two simple unrelated ideas and puts them together. I doubt it'd be of much use in the real world, but it's something to ponder.
1) Take your file and attempt to sort its bytes, but keeping track of the changes so that it's possible to reconstruct the original. It doesn't need to be completely sorted, just enough so that there are large groups of bytes with similar or identical values. So if we have CAEBECDE as our input, we may get CCAEEEBD as our output. The more a file is sorted, the more additional data it requires to restore the original file from the sorted contents, so there needs to be a balance.
2) Divide the output from step 1 into blocks, like so: CC A EEE B D
3) Compress each block separately with adaptive Huffman encoding. Since the blocks are more or less sorted, byte frequencies will be more skewed, enabling more efficient compression per block.
4) Recombine the blocks into a compressed file.
Questions:
1) What would be the best sorting algorithm to use? We need to consider:
a) Speed
b) Amount of auxiliary data required to restore original order
c) Memory requirements
2) How much sorting should be allowed? Obviously we can't allow the file to be sorted completely, as it would require too much aux data. Or can we?
3) Would this actually result--in the typical case--in better compression than, say, LZW?
4) What modifications could be made to this technique to make it better? What are the potential pitfalls one would encounter?
Edited by dargueta, 27 August 2009 - 05:53 PM.


Sign In
Create Account

Back to top











