The area scan (mode 0) does checks for iterations of 2 bit patterns, where each is checked in a 'is it this one, or this one'.
Is it 00 or 01? Is it 00 or 10? Is it 00 or 11? Is it 01 or 10? Without repeating, and using the same combination of 'is it this or this', you have the iterations for 1,2,3 and 4 patterns (4 being 1 possibility only).
The idea is that any data stretch containing iterations of 1,2 and 3 are always compressible providing enough length for overheads. The 4 patterns can either be compressible or not.
By identifying all the separate 1,2,3 stretches, you can process these individually and intentionally treat the 4 stretches as raw with codes for length being a counter for how many bytes for example is raw data before code data begins again.
Since 4 pattern can also be compressed, an attempt can be done on the 4 pattern stretches where applicable, compare results to those of data in the 1,2,3 ones for whichever data is suitable, and where possible to cancel stretches where results don't look feasible.
This part is still not clearly written and is WIP. This is the idea.
The idea is to do everything in clear steps.
The initial shrinking of the file involves a 'proportional shrink' where the data is shrunk as it is, similar to a huffman code process where all bytes are scanned for byte+frequency and then represented in codes - anything repeating in a row will have all its codes in a row just the same.
Once the proportional shrink is done, separate processes can be done to target 'positioning' elements such as clones/clusters similar to a 'copy' of a string which appears around (random) where you can target the ideal clusters and represent them as its own code and have the 'copy' in the code tree, chunk length which targets things repeating at set distances (like a video with common data in each frame where data is common at a specific length apart), and in a row which will be captured as a number directly.
The idea is that by treating each element/component individually, you can gain the highest compression by treating the targeted prevalence in the most suitable way possible. Repetition in a row will be captured the smallest possible by using a number to represent this. Looking at strings, if you look at minimal variable strings and then the clones/clusters for the ideal set of these for minimal codes to represent these, it is the smallest possible way.
The point is to incorporate all possible elements/components to target prevalence in the most suitable manner for the smallest file, and also consider priorities/weights/clashes so something like clone/cluster bypasses anything that is in a row so that the in a row part will get the entry as a number for efficiency.
I list a number of approaches, string, polarity gap, ncr, math for example. These are alternate ways of reading data bits, and each has a potential to compress data better than another.
If you have 126 0 bits, and 2 1 bits for example at position 35 and 70, by using polarity gap you identify that you target bit '1', starts at position '35', next occurrence is '35' apart and this is pretty much what you need. For something like strings, you would require multiple entries in the code tree for the patterns of 0/1 which are suitable, in a row entries of multiple values.
While both compress, the polarity gap method is more suitable for a smaller result.
Using nCr in general on data where there is a large amount of prevalence of 0/1 in particular irrespective of placement, it is beneficial for a smallest result. nCr of 128,2 is '8128', which can fit into 13 bits which including some data the overhead for 'polarity 1' '128,2' '8128' as the highest possibility you expect this to take the least space over strings/polarity gap/ncr.
There is also math which looks at a specific pattern and shifts in ADD/SUB to the next one. There is more information in 'TO DO' at the bottom in regards to this, and needs writing. The idea is to check ADD/SUB possibilities, and this is similar to looking at the makeup of the neighboring data and its shifts in polarity/position resembling math as another way to read.
There are some data patterns where the result could look like a prevalent repeating pattern of ADD 1 SUB 2 for example which will indefinitely compress data more than string/polarity gap/ncr from the data appearing somewhat random yet having a clear math pattern which will compress.
Math also applies to chunk length of things appearing in set distances where something like a pattern of 100 apart, 99 apart, 98 apart etc can be represented as length 100 - 1 using math, which is impossible with static offset length values otherwise.
There is also the 'circumstantial' method I mention applicable to chunk length and in a row where you have something like 'if there is x and z appears n spots away, y will always follow x', or 'if there is x and z is 2 spots away, y always follows'.
All ys in the file will be removed and similar to an emulator, all patterns will be scanned for circumstances such as 'is this one x? if so, is the one 2 spots away z?'
There are some data patterns where there can be prevalence of such a circumstance, and where applicable other common code entries can be rewritten with the same data and have an extra code in the code tree but have the other data applicable for circumstantial.
There's mods mentioned to also look to find data which is mirror/inverse to include more compression where multiple entries can apply to the same mod code of mirror/inverse and reduce the code tree and entries/length of bits of codes in the file all round.
There is also more in the TO DO at the bottom.
The idea is ultimately to incorporate all the multiple approaches/components to capture compression/prevalence in the suitable manner and dynamically with hardcoded priorities/weights/clashes in order to be able to give you the smallest possible result when using multiple components concurrently.
A temporary mapping file is used to put all the placements of the data all round. This will require generous temporary space, such as for all proportional placements for strings. Then, once clone/cluster is done, the file is updated to reflect the updated changes and reductions. Then in a row is updated to reduce the temporary mapping file even further.
The idea is to take time working on the temporary mapping file itself once built with proportional data to ultimately have the idea mapping file to use to write the final file.
For something like minimal variable strings, once you detect the most effective variable length string in the file all its positions need to be mapped in order to rescan the file and find the next most effective variable length string without a clash, and accurately keep all position data.
The temporary mapping file is used to keep track of these things.
Other common compression tools generally don't deal with variable data and confine itself to 8 bit which means this part isn't as necessary, including the fact that these tools can also deal with chunks which can be dealt with and written for what it is rather a process on an entire global file as is.
The point is to aim for the smallest possible compressed result, dynamically.
There is the option to even allow for multiple code trees of different types such as having a specific area which has common frequency of data be suited for a static length code tree just for that particular area, then the next area has a variable code tree of a specific setup to suit the data.
Rather than build one code tree to suit the entire file, you can be suitable to the content and data layout to compress the data further.
This is a method which can take its time to create, but once created can extract very simply and is essentially packed together tight with all that processing
Giving examples of different formats isn't as clear as the idea of looking at the prevalence in the file.
If a file is all 0 bits, it can be captured by string and then in a row, and a 4gb file can be the few bytes you expect this to be.
It is all dependent on the content of the data in the file and the prevalence, irrespective of 'file type' and as an entry in the data spectrum of 0/1 with prevalence, where multiple approaches/components are able to find the prevalence.
From the initial WIP I was doing, I did some work on the area scan, minimal variable string, clone/cluster and in a row as a number. The rest including mentions in TO DO regarding moving data/fake extracting and other things to further reduce the data is there.
From an example during that time of a BMP of 2.25MB 1024x768 where there are 8 colours, each colour is 4 lines worth (4096 times) and that cluster of 32 lines repeats 24 times, PNG was 5KB, RAR was 2KB and 7z was 1KB.
I figured this should be 50B even, if you consider it is a basic pattern and there is prevalence of '4095 times' as a number which is very small, this is used often, and the repetition of the layer of the 32 lines repeating '23' times.
DEFLATE apparently got this to under 100B and with another shrink to suit 5 bits per entry and the header it would be around 50B.
It is expected for this file to reduce for its common strings, for chunk length to accurately identify the 32 lines repeating 24 times, and in a row to get the accurate in a row for the colours so that the file would be around 50B.
Alternatively without chunk length, it was expected that the data would reduce and with an internal repeat it would capture the 24 times as encapsulation and be around 50B, having reduced colour information as codes, and a couple entries for '4095' and '23' for in a row and a layered code tree with minimal codes all round and for the data using the temporary mapping file.
Using smallest possible results and enough time, you figure a file with such prevalence should be around 50B.
Other tools out there are cheap on space/time, and 7z for example is capable of clone/in a row essentially and doesn't target chunk length data on top of being weak with its chunk spacing.
Something like 4GB in a row would be far from the few bytes expected when using 7z/RAR etc. This can be specific to the exact placement if there is a 1 bit in there and have exact and specific numbers for in a row, which would be smaller than the jump expected in 7z/RAR.
When using the raw components targeting smallest possible results, you expect data to compress far smaller than other tools considering it targets smallest possible result.
The compression depends on the file, the prevalence it has, and the components used.
There is a 50/50 mentioned of files with and without prevalence, where the latter is slots for the compressed file. There is mention of a ratio of cpu/space/bandwidth and prevalence being the element for transformation where for space/time purposes you cut down on cpu and increase space, and also relieve bandwidth for example.
If one looks at minimal variable strings, minimal cluster/clone, chunk length, and in a row as a number, it should by itself floor 7z, rar and the other common tools in the expected size. For media as well, with common repeating data in chunk lengths, it should be captured to a small size compared to other possibly bloated approaches. It is going for minimal compressed result with the components, working with the temporary mapping file and taking its time.
As for speed, ASM is suggested and skipping any GUI counter elements for direct processing, but it would involve reading a file a number of times, such as few hundred left-to-right depending, and reducing for each process on the temporary mapping file as it gets adjusted.
It is doable in reasonable time, using careful constraints to do minimal scans required, and also consider that to be thorough in this way, this is what is mathematically required.
Extraction once the file is tightly packed in, is simply looking at codes in the file and the code tree to extract, and is very fast and straightforward. The compressed file itself should be very tightly packed.
I hope this is enough detail for you.
Edited by SoraK05, 21 November 2014 - 11:38 PM.