Jump to content

Thought Experiment: File Compression

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
6 replies to this topic

#1
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,720 posts
EDIT: I'm an idiot...someone move this to General Programming.

So I had this idea for a lossless compression algorithm the other day, and I was wondering what others would think of it. It really just takes two simple unrelated ideas and puts them together. I doubt it'd be of much use in the real world, but it's something to ponder.

1) Take your file and attempt to sort its bytes, but keeping track of the changes so that it's possible to reconstruct the original. It doesn't need to be completely sorted, just enough so that there are large groups of bytes with similar or identical values. So if we have CAEBECDE as our input, we may get CCAEEEBD as our output. The more a file is sorted, the more additional data it requires to restore the original file from the sorted contents, so there needs to be a balance.

2) Divide the output from step 1 into blocks, like so: CC A EEE B D

3) Compress each block separately with adaptive Huffman encoding. Since the blocks are more or less sorted, byte frequencies will be more skewed, enabling more efficient compression per block.

4) Recombine the blocks into a compressed file.

Questions:
1) What would be the best sorting algorithm to use? We need to consider:
a) Speed
b) Amount of auxiliary data required to restore original order
c) Memory requirements

2) How much sorting should be allowed? Obviously we can't allow the file to be sorted completely, as it would require too much aux data. Or can we?

3) Would this actually result--in the typical case--in better compression than, say, LZW?

4) What modifications could be made to this technique to make it better? What are the potential pitfalls one would encounter?

Edited by dargueta, 27 August 2009 - 05:53 PM.

sudo rm -rf /

#2
BlaineSch

BlaineSch

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 2,448 posts
Well if you have a counter for each byte you move, say you move one byte 30 times, an array of bytes the exact size of the program would exist... would this really compress?

#3
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,720 posts
I tried it with the bubble sort (just because it's quick to do). I used a single bit to indicate whether a swap had been made or not, and that worked just fine. With eight passes you can sort it enough to double the file size, but there are optimizations you can stick in the sorting algorithm to make it sort faster (cf. Comb Sort). You can use one or two bits per swap for the merge sort, I think, though I haven't tried it yet. Weekend project, perhaps.
sudo rm -rf /

#4
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
My concern is that the record of how to "unsort", very different from a sort record, may bloat quite a bit. Any analysis is going to come down to the efficiency of the sort record.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#5
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,720 posts
Exactly. So basically we want to find a sorting algorithm that performs the minimum number of comparisons. I think the merge sort would work decently if we get rid of the merging part, since that would require a lot more extra data.
sudo rm -rf /

#6
ArekBulski

ArekBulski

    Speaks fluent binary

  • Members
  • PipPipPipPipPipPipPipPip
  • 1,376 posts
Are you thinking of sorting it on per byte basis, or to move some bigger and smaller blocks around?

#7
dargueta

dargueta

    Writes binary right handed and hex left handed

  • Moderators
  • 4,720 posts
Depends on the performance. For smaller file sizes individual bytes could work; for larger file sizes blocks would have to do.
sudo rm -rf /