Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Volunteer Request for Compression Tool

compression

This topic has been archived. This means that you cannot reply to this topic.
17 replies to this topic

#1 SoraK05

SoraK05

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 115 posts

Posted 19 November 2014 - 05:04 AM

Hello.

I have posted in this forum in the past during a WIP I was putting together which was not (seemingly) read and acknowledged for the WIP it was.

 

I am willing to put some further developed WIP which should be functional for elements found in common media (audio/video) and in general has an aim to reduce a file to optimal minimal sizes.

It is WIP and suggestions are open.

 

Between what is written and in the 'TO DO' at the bottom, general descriptions required for the overview is there.

Part of the writeup is fully functional to implement (including some TO DO which should be able to implement), and from what is should beat lossless compression for general media among more types of data.

 

A 4GB file of 0 bits should be the few bytes it should be. A file with repetition occurring every 10 spots should have this entry as the few bytes. Instead of looking at 8 bit patterns to store, variable length results can be obtained for an ultimately smaller code tree, written entries and their length. Multiple ways to code as well as multiple code trees in a dynamic way can be done to be suitable.

There is also more. Other tools in general do not produce these expected sizes.

 

 

When this is done without any GUI counters and direct (blank CLI) processing, it should be usable with minimal scanning in mind for the more optimal selections. Source code which can be used to port to ASM would also make it snappy.

 

 

This should make files much smaller than RAR/7z, and other compression methods even used in media, considering it approaches multiple compression techniques and targets the smallest result.

 

 

 

For those willing to read and for constructive feedback, this is appreciated :)

Even the strings aspect for something functional and should compress to much smaller sizes than tools out there, as more of an 'all-in-one'.

 

It is WIP and some text may be jumpy, however read the TO DO at the bottom.

Attached Files


Edited by SoraK05, 19 November 2014 - 05:18 AM.


#2 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts

Posted 19 November 2014 - 07:48 AM

I think you are incorrect in assuming people didn't realize you had a WIP. What you failed to realize was we were attempting to offer constructive criticism or get you to iron out unclear details.

 

A general comment on the document, overall. I realize it's a technical document, but MS Word gives it a reading grade level of 14.0. I know, many of us are college graduates, but this implies the text is NOT as easy to read as it probably could be.

 

Lines 1-65 feel like a general motivational phrase, but introduces many terms that have not been defined yet in the document and will not be universally understood. I would advise, for clarity, defining the various terms or perhaps use its content throughout the document where it will be more readily understood. I was able to follow it mainly due to our previous discussions.

 

Mode 0: you list 15 general entries, but do not describe the logic behind them. After doing some analysis, I realized the pattern is the following:

 

1) list all possible bit patterns composed of the four bit pairs from length 1 pair to 4 pairs

2) restrict the list back to only those items where a bit pair is followed by a "higher" pit bair (00 can be followed by 01, but 01 cannot be followed by 00)

3) implication; this means that when reading a string in the source document, you know you start a new sequence when a bit pair is <= the current bit pair. This is also why you are limited to 4 pairs (longest possible sequence before you hit the <= condition)

 

Explaining this logic will help implementers and assist in analyzing the impact of this analysis on various types of files.

 

Lines 130-141 would be served well by some examples of the analysis on various bit sequences, as it is not completely clear what the scan is doing, and a few short examples on various types of data would probably clarify what's being done. I'm thinking 20-100 bytes per example would probably be adequate.

 

Lines 149-160 are fairly unclear on an initial reading. My guess is that the suggested example above will make it clearer, as the reader will then have a clear understanding of what is done in Mode 0. Again, fleshing out the examples could be helpful as well.

 

Lines 174-201 may be more understandable if moved lower. Discussing advantages of nCr, for example, when nCr hasn't been fully introduced is very difficult. Think of the approach used in a math textbook where term is first defined, and then discussed. Sometimes there is a concrete example to motivate the definition of a term, followed by that definition, and only then by further discussion.

 

Lines 220 - 461: in this case, we have an example with many caculations, but nothing's been defined in advance.

 

Lines 462-471: Discussion of temporary files is mentioned, along with further analysis. Examples of both would probably be helpful.

 

Lines 498- 511: It would seem that Polarity gaps would have limited benefit over the STRINGS method. Can you present an example of a string where Polarity Gaps are preferable?

 

Lines 516-526: It took a couple readings to get at what you're after with this. While an interesting idea, I have several concerns with it. First, you talk about with 98% 0 and 2% 1 the number of combinations is comparatively very small, but is not the case for any file of non-trivial size. For example 1000 bits that are 98% 0 has 3.39 E 41 possible strings. So we need a 140 bit integer to store that, plus the overhead of iterating through all those possible strings to get to the "correct" one for both compression and decompression. Any algorithm that throws O(n!) time complexity into the mix is probably a VERY BAD IDEA ™.

 

Lines 530-539: I have deep concerns about both the time complexity of this method, as well as the actual space required to store the results. In general, I would expect something that is amenable to this method to already compress well under the STRINGS method. Can you provide examples where this method is superior to STRINGS?

 

I did not get on beyond this, as I think there's already enough here for major discussion.

 

I have a few more concerns that may or may not have been addressed, but I'll raise them here as a simple "is it dealt with?": you are using temporary files a lot. Given that files that are worth compressing tend to be large, I think this will be a necessity for your algorithm, but it also introduces a significant problem for it: disk access is generally slow. Depending on the intensity of the data read/writes and where they occur inside of other time-complexity intense operations, this could cause compression/decompression of files to change from a slow process to an infeasible process. Can you address this clearly?

 

For temp files, I did not see any indication of what the file formats might look like, nor what a temp file for stage 3 of 5 (etc) in processing might look like. They could certainly be optimized later, but having a clearer indication of what you intend to record would help.

 

Do you have a fully worked out example? The only example I saw was Mode 1: STRINGS, scan 1.

 

Can you comment on how much the algorithms in this WIP have changed from the previous version?

 

I still have two major concerns left over from the previous WIP document(s): Few examples, and somewhat unclear algorithms. It is difficult to analyze the effectiveness of your ideas based on what has been presented. Of interest for anyone seeking to analyze this would be the following:

1) average number of computations required to compress a 1KB file.

2) average number of computations required to decompress a compressed 1KB file.

3) average compression ratio of a 1KB file.

4) average compression ratio of a .BMP file (larges amounts of repeating data)

5) average compression ratio of a .XML file (text)

6) average compression ratio of a .JPG file (compressed data)

7) average compression ratio of a .DOC file (mixture of text and binary data)

 

Obviously, having an implementation would be key to answering a number of the above questions, but I'm still unsure how to actually do that.


Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#3 SoraK05

SoraK05

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 115 posts

Posted 21 November 2014 - 10:55 PM

The area scan (mode 0) does checks for iterations of 2 bit patterns, where each is checked in a 'is it this one, or this one'.

Is it 00 or 01? Is it 00 or 10? Is it 00 or 11? Is it 01 or 10? Without repeating, and using the same combination of 'is it this or this', you have the iterations for 1,2,3 and 4 patterns (4 being 1 possibility only).

 

The idea is that any data stretch containing iterations of 1,2 and 3 are always compressible providing enough length for overheads. The 4 patterns can either be compressible or not.

 

By identifying all the separate 1,2,3 stretches, you can process these individually and intentionally treat the 4 stretches as raw with codes for length being a counter for how many bytes for example is raw data before code data begins again.

 

Since 4 pattern can also be compressed, an attempt can be done on the 4 pattern stretches where applicable, compare results to those of data in the 1,2,3 ones for whichever data is suitable, and where possible to cancel stretches where results don't look feasible.

This part is still not clearly written and is WIP. This is the idea.

 

 

The idea is to do everything in clear steps.

The initial shrinking of the file involves a 'proportional shrink' where the data is shrunk as it is, similar to a huffman code process where all bytes are scanned for byte+frequency and then represented in codes - anything repeating in a row will have all its codes in a row just the same.

 

Once the proportional shrink is done, separate processes can be done to target 'positioning' elements such as clones/clusters similar to a 'copy' of a string which appears around (random) where you can target the ideal clusters and represent them as its own code and have the 'copy' in the code tree, chunk length which targets things repeating at set distances (like a video with common data in each frame where data is common at a specific length apart), and in a row which will be captured as a number directly.

 

The idea is that by treating each element/component individually, you can gain the highest compression by treating the targeted prevalence in the most suitable way possible. Repetition in a row will be captured the smallest possible by using a number to represent this. Looking at strings, if you look at minimal variable strings and then the clones/clusters for the ideal set of these for minimal codes to represent these, it is the smallest possible way.

 

The point is to incorporate all possible elements/components to target prevalence in the most suitable manner for the smallest file, and also consider priorities/weights/clashes so something like clone/cluster bypasses anything that is in a row so that the in a row part will get the entry as a number for efficiency.

 

 

I list a number of approaches, string, polarity gap, ncr, math for example. These are alternate ways of reading data bits, and each has a potential to compress data better than another.

If you have 126 0 bits, and 2 1 bits for example at position 35 and 70, by using polarity gap you identify that you target bit '1', starts at position '35', next occurrence is '35' apart and this is pretty much what you need. For something like strings, you would require multiple entries in the code tree for the patterns of 0/1 which are suitable, in a row entries of multiple values.

While both compress, the polarity gap method is more suitable for a smaller result.

 

Using nCr in general on data where there is a large amount of prevalence of 0/1 in particular irrespective of placement, it is beneficial for a smallest result. nCr of 128,2 is '8128', which can fit into 13 bits which including some data the overhead for 'polarity 1' '128,2' '8128' as the highest possibility you expect this to take the least space over strings/polarity gap/ncr.

 

There is also math which looks at a specific pattern and shifts in ADD/SUB to the next one. There is more information in 'TO DO' at the bottom in regards to this, and needs writing. The idea is to check ADD/SUB possibilities, and this is similar to looking at the makeup of the neighboring data and its shifts in polarity/position resembling math as another way to read.

There are some data patterns where the result could look like a prevalent repeating pattern of ADD 1 SUB 2 for example which will indefinitely compress data more than string/polarity gap/ncr from the data appearing somewhat random yet having a clear math pattern which will compress.

 

Math also applies to chunk length of things appearing in set distances where something like a pattern of 100 apart, 99 apart, 98 apart etc can be represented as length 100 - 1 using math, which is impossible with static offset length values otherwise.

 

 

There is also the 'circumstantial' method I mention applicable to chunk length and in a row where you have something like 'if there is x and z appears n spots away, y will always follow x', or 'if there is x and z is 2 spots away, y always follows'.

All ys in the file will be removed and similar to an emulator, all patterns will be scanned for circumstances such as 'is this one x? if so, is the one 2 spots away z?'

 

There are some data patterns where there can be prevalence of such a circumstance, and where applicable other common code entries can be rewritten with the same data and have an extra code in the code tree but have the other data applicable for circumstantial.

 

There's mods mentioned to also look to find data which is mirror/inverse to include more compression where multiple entries can apply to the same mod code of mirror/inverse and reduce the code tree and entries/length of bits of codes in the file all round.

 

 

There is also more in the TO DO at the bottom.

 

The idea is ultimately to incorporate all the multiple approaches/components to capture compression/prevalence in the suitable manner and dynamically with hardcoded priorities/weights/clashes in order to be able to give you the smallest possible result when using multiple components concurrently.

 

 

A temporary mapping file is used to put all the placements of the data all round. This will require generous temporary space, such as for all proportional placements for strings. Then, once clone/cluster is done, the file is updated to reflect the updated changes and reductions. Then in a row is updated to reduce the temporary mapping file even further.

The idea is to take time working on the temporary mapping file itself once built with proportional data to ultimately have the idea mapping file to use to write the final file.

 

For something like minimal variable strings, once you detect the most effective variable length string in the file all its positions need to be mapped in order to rescan the file and find the next most effective variable length string without a clash, and accurately keep all position data.

 

The temporary mapping file is used to keep track of these things.

 

 

Other common compression tools generally don't deal with variable data and confine itself to 8 bit which means this part isn't as necessary, including the fact that these tools can also deal with chunks which can be dealt with and written for what it is rather a process on an entire global file as is.

 

The point is to aim for the smallest possible compressed result, dynamically.

There is the option to even allow for multiple code trees of different types such as having a specific area which has common frequency of data be suited for a static length code tree just for that particular area, then the next area has a variable code tree of a specific setup to suit the data.

Rather than build one code tree to suit the entire file, you can be suitable to the content and data layout to compress the data further.

 

 

This is a method which can take its time to create, but once created can extract very simply and is essentially packed together tight with all that processing :)

 

Giving examples of different formats isn't as clear as the idea of looking at the prevalence in the file.

If a file is all 0 bits, it can be captured by string and then in a row, and a 4gb file can be the few bytes you expect this to be.

It is all dependent on the content of the data in the file and the prevalence, irrespective of 'file type' and as an entry in the data spectrum of 0/1 with prevalence, where multiple approaches/components are able to find the prevalence.

 

 

 

From the initial WIP I was doing, I did some work on the area scan, minimal variable string, clone/cluster and in a row as a number. The rest including mentions in TO DO regarding moving data/fake extracting and other things to further reduce the data is there.

 

From an example during that time of a BMP of 2.25MB 1024x768 where there are 8 colours, each colour is 4 lines worth (4096 times) and that cluster of 32 lines repeats 24 times, PNG was 5KB, RAR was 2KB and 7z was 1KB.

I figured this should be 50B even, if you consider it is a basic pattern and there is prevalence of '4095 times' as a number which is very small, this is used often, and the repetition of the layer of the 32 lines repeating '23' times.

 

DEFLATE apparently got this to under 100B and with another shrink to suit 5 bits per entry and the header it would be around 50B.

 

It is expected for this file to reduce for its common strings, for chunk length to accurately identify the 32 lines repeating 24 times, and in a row to get the accurate in a row for the colours so that the file would be around 50B.

Alternatively without chunk length, it was expected that the data would reduce and with an internal repeat it would capture the 24 times as encapsulation and be around 50B, having reduced colour information as codes, and a couple entries for '4095' and '23' for in a row and a layered code tree with minimal codes all round and for the data using the temporary mapping file.

 

Using smallest possible results and enough time, you figure a file with such prevalence should be around 50B.

Other tools out there are cheap on space/time, and 7z for example is capable of clone/in a row essentially and doesn't target chunk length data on top of being weak with its chunk spacing.

Something like 4GB in a row would be far from the few bytes expected when using 7z/RAR etc. This can be specific to the exact placement if there is a 1 bit in there and have exact and specific numbers for in a row, which would be smaller than the jump expected in 7z/RAR.

 

 

When using the raw components targeting smallest possible results, you expect data to compress far smaller than other tools considering it targets smallest possible result.

 

The compression depends on the file, the prevalence it has, and the components used.

There is a 50/50 mentioned of files with and without prevalence, where the latter is slots for the compressed file. There is mention of a ratio of cpu/space/bandwidth and prevalence being the element for transformation where for space/time purposes you cut down on cpu and increase space, and also relieve bandwidth for example.

 

 

 

If one looks at minimal variable strings, minimal cluster/clone, chunk length, and in a row as a number, it should by itself floor 7z, rar and the other common tools in the expected size. For media as well, with common repeating data in chunk lengths, it should be captured to a small size compared to other possibly bloated approaches. It is going for minimal compressed result with the components, working with the temporary mapping file and taking its time.

 

As for speed, ASM is suggested and skipping any GUI counter elements for direct processing, but it would involve reading a file a number of times, such as few hundred left-to-right depending, and reducing for each process on the temporary mapping file as it gets adjusted.

 

 

It is doable in reasonable time, using careful constraints to do minimal scans required, and also consider that to be thorough in this way, this is what is mathematically required.

Extraction once the file is tightly packed in, is simply looking at codes in the file and the code tree to extract, and is very fast and straightforward. The compressed file itself should be very tightly packed.

 

 

I hope this is enough detail for you.


Edited by SoraK05, 21 November 2014 - 11:38 PM.


#4 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts

Posted 24 November 2014 - 06:07 AM

The area scan (mode 0) does checks for iterations of 2 bit patterns, where each is checked in a 'is it this one, or this one'.

Is it 00 or 01? Is it 00 or 10? Is it 00 or 11? Is it 01 or 10? Without repeating, and using the same combination of 'is it this or this', you have the iterations for 1,2,3 and 4 patterns (4 being 1 possibility only).

 

The idea is that any data stretch containing iterations of 1,2 and 3 are always compressible providing enough length for overheads. The 4 patterns can either be compressible or not.

 

By identifying all the separate 1,2,3 stretches, you can process these individually and intentionally treat the 4 stretches as raw with codes for length being a counter for how many bytes for example is raw data before code data begins again.

 

Since 4 pattern can also be compressed, an attempt can be done on the 4 pattern stretches where applicable, compare results to those of data in the 1,2,3 ones for whichever data is suitable, and where possible to cancel stretches where results don't look feasible.

This part is still not clearly written and is WIP. This is the idea.

 

 

Okay, First of wall, being a WIP is not an excuse for not addressing issues that people have raised. Why does this WIP version not address the numerous points that were raised in the last one?

 

With that said, identifying 15 patterns requires 4 bits to store. With that in mind, how are you actually going to do compression, when you will be expanding the number of bits more often than shrinking them?

 

 


 

The idea is to do everything in clear steps.

The initial shrinking of the file involves a 'proportional shrink' where the data is shrunk as it is, similar to a huffman code process where all bytes are scanned for byte+frequency and then represented in codes - anything repeating in a row will have all its codes in a row just the same.

 

Once the proportional shrink is done, separate processes can be done to target 'positioning' elements such as clones/clusters similar to a 'copy' of a string which appears around (random) where you can target the ideal clusters and represent them as its own code and have the 'copy' in the code tree, chunk length which targets things repeating at set distances (like a video with common data in each frame where data is common at a specific length apart), and in a row which will be captured as a number directly.

 

The idea is that by treating each element/component individually, you can gain the highest compression by treating the targeted prevalence in the most suitable way possible. Repetition in a row will be captured the smallest possible by using a number to represent this. Looking at strings, if you look at minimal variable strings and then the clones/clusters for the ideal set of these for minimal codes to represent these, it is the smallest possible way.

 

The point is to incorporate all possible elements/components to target prevalence in the most suitable manner for the smallest file, and also consider priorities/weights/clashes so something like clone/cluster bypasses anything that is in a row so that the in a row part will get the entry as a number for efficiency.

 

 

Unfortunately, you have not provided clear steps. That is the one big flaw that was pointed out to you last time, and it remains an issue with your WIP. You really, really need to spell out those clear steps for us. Ideally, with examples. Any analysis of efficiency, space or time, is nearly impossible without clear steps.

 

 

I list a number of approaches, string, polarity gap, ncr, math for example. These are alternate ways of reading data bits, and each has a potential to compress data better than another.

If you have 126 0 bits, and 2 1 bits for example at position 35 and 70, by using polarity gap you identify that you target bit '1', starts at position '35', next occurrence is '35' apart and this is pretty much what you need. For something like strings, you would require multiple entries in the code tree for the patterns of 0/1 which are suitable, in a row entries of multiple values.

While both compress, the polarity gap method is more suitable for a smaller result.

 

Using nCr in general on data where there is a large amount of prevalence of 0/1 in particular irrespective of placement, it is beneficial for a smallest result. nCr of 128,2 is '8128', which can fit into 13 bits which including some data the overhead for 'polarity 1' '128,2' '8128' as the highest possibility you expect this to take the least space over strings/polarity gap/ncr.

 

There is also math which looks at a specific pattern and shifts in ADD/SUB to the next one. There is more information in 'TO DO' at the bottom in regards to this, and needs writing. The idea is to check ADD/SUB possibilities, and this is similar to looking at the makeup of the neighboring data and its shifts in polarity/position resembling math as another way to read.

There are some data patterns where the result could look like a prevalent repeating pattern of ADD 1 SUB 2 for example which will indefinitely compress data more than string/polarity gap/ncr from the data appearing somewhat random yet having a clear math pattern which will compress.

 

Math also applies to chunk length of things appearing in set distances where something like a pattern of 100 apart, 99 apart, 98 apart etc can be represented as length 100 - 1 using math, which is impossible with static offset length values otherwise.

 

The above is an excellent example of what I mean. You present three methods in broad generalities, but no detail. nCr is the only one that is fairly self-evident from an encoding perspective, but there are still deep concerns I brought up which you have not addressed.

 

When discussing encryption, it's important to remember the context of what we're doing. For the business world, as an example, when a file reaches a certain size we just put it on an external hard drive and mail it physically. So, if you come up with an encryption mechanism that, on average, shrinks a file to 1/100th the size of the original, but takes a month to do it compared to simply copying the file to an external hard drive for a day and overnighting it to the destination, there is no value. If the decompression is also time prohibitive, then no matter how fantastic it is, it's useless.

 

Until we can analyze some of your algorithms in detail we have know idea what we're dealing with. I realize it's a WIP, but now you know what you need to focus on: providing the details required for analysis, not generating new methods.

 

 

There is also the 'circumstantial' method I mention applicable to chunk length and in a row where you have something like 'if there is x and z appears n spots away, y will always follow x', or 'if there is x and z is 2 spots away, y always follows'.

All ys in the file will be removed and similar to an emulator, all patterns will be scanned for circumstances such as 'is this one x? if so, is the one 2 spots away z?'

 

There are some data patterns where there can be prevalence of such a circumstance, and where applicable other common code entries can be rewritten with the same data and have an extra code in the code tree but have the other data applicable for circumstantial.

 

There's mods mentioned to also look to find data which is mirror/inverse to include more compression where multiple entries can apply to the same mod code of mirror/inverse and reduce the code tree and entries/length of bits of codes in the file all round.

 

 

There is also more in the TO DO at the bottom.

 

The idea is ultimately to incorporate all the multiple approaches/components to capture compression/prevalence in the suitable manner and dynamically with hardcoded priorities/weights/clashes in order to be able to give you the smallest possible result when using multiple components concurrently.

 

 

A temporary mapping file is used to put all the placements of the data all round. This will require generous temporary space, such as for all proportional placements for strings. Then, once clone/cluster is done, the file is updated to reflect the updated changes and reductions. Then in a row is updated to reduce the temporary mapping file even further.

The idea is to take time working on the temporary mapping file itself once built with proportional data to ultimately have the idea mapping file to use to write the final file.

 

For something like minimal variable strings, once you detect the most effective variable length string in the file all its positions need to be mapped in order to rescan the file and find the next most effective variable length string without a clash, and accurately keep all position data.

 

The temporary mapping file is used to keep track of these things.

 

Again, details. I can tell you that GIT uses difference files to store changes from one version of a file to the next, but unless I show you the file format and explain it, that doesn't mean you can analyze how it works or provide feedback on it. The fact that you have several different methods involved means your temporary mapping file has to keep track of the method as well as the result. And it has to (presumably) do it in such a way that it's relatively easy to reverse the process, as I would expect the temporary mapping file to be part of the resulting compressed file format. But without details, I don't know for sure. What I envision as an implementation and what you envision could be very, very different.

 

 

Other common compression tools generally don't deal with variable data and confine itself to 8 bit which means this part isn't as necessary, including the fact that these tools can also deal with chunks which can be dealt with and written for what it is rather a process on an entire global file as is.

 

The point is to aim for the smallest possible compressed result, dynamically.

There is the option to even allow for multiple code trees of different types such as having a specific area which has common frequency of data be suited for a static length code tree just for that particular area, then the next area has a variable code tree of a specific setup to suit the data.

Rather than build one code tree to suit the entire file, you can be suitable to the content and data layout to compress the data further.

 

 

This is a method which can take its time to create, but once created can extract very simply and is essentially packed together tight with all that processing :)

 

Giving examples of different formats isn't as clear as the idea of looking at the prevalence in the file.

If a file is all 0 bits, it can be captured by string and then in a row, and a 4gb file can be the few bytes you expect this to be.

It is all dependent on the content of the data in the file and the prevalence, irrespective of 'file type' and as an entry in the data spectrum of 0/1 with prevalence, where multiple approaches/components are able to find the prevalence.

 

 

 

From the initial WIP I was doing, I did some work on the area scan, minimal variable string, clone/cluster and in a row as a number. The rest including mentions in TO DO regarding moving data/fake extracting and other things to further reduce the data is there.

 

From an example during that time of a BMP of 2.25MB 1024x768 where there are 8 colours, each colour is 4 lines worth (4096 times) and that cluster of 32 lines repeats 24 times, PNG was 5KB, RAR was 2KB and 7z was 1KB.

I figured this should be 50B even, if you consider it is a basic pattern and there is prevalence of '4095 times' as a number which is very small, this is used often, and the repetition of the layer of the 32 lines repeating '23' times.

 

DEFLATE apparently got this to under 100B and with another shrink to suit 5 bits per entry and the header it would be around 50B.

 

It is expected for this file to reduce for its common strings, for chunk length to accurately identify the 32 lines repeating 24 times, and in a row to get the accurate in a row for the colours so that the file would be around 50B.

Alternatively without chunk length, it was expected that the data would reduce and with an internal repeat it would capture the 24 times as encapsulation and be around 50B, having reduced colour information as codes, and a couple entries for '4095' and '23' for in a row and a layered code tree with minimal codes all round and for the data using the temporary mapping file.

 

Using smallest possible results and enough time, you figure a file with such prevalence should be around 50B.

Other tools out there are cheap on space/time, and 7z for example is capable of clone/in a row essentially and doesn't target chunk length data on top of being weak with its chunk spacing.

Something like 4GB in a row would be far from the few bytes expected when using 7z/RAR etc. This can be specific to the exact placement if there is a 1 bit in there and have exact and specific numbers for in a row, which would be smaller than the jump expected in 7z/RAR.

A BMP with only 8 colors sounds highly unrealistic. Almost all images allow the user access to a MINIMUM of 256 colors, and generally the full 30million. Even the web supports all colors. As the number of colors goes up, the number of repetitions drops. Using an unrealistic example for discussion of possible compression savings is generally not going to be persuasive, as those are considered outlier scenarios anyway. Nobody is working to optimize compression of such unusual files for a reason: they don't exist in the real world.

 

 

When using the raw components targeting smallest possible results, you expect data to compress far smaller than other tools considering it targets smallest possible result.

 

The compression depends on the file, the prevalence it has, and the components used.

There is a 50/50 mentioned of files with and without prevalence, where the latter is slots for the compressed file. There is mention of a ratio of cpu/space/bandwidth and prevalence being the element for transformation where for space/time purposes you cut down on cpu and increase space, and also relieve bandwidth for example.

 

 

 

If one looks at minimal variable strings, minimal cluster/clone, chunk length, and in a row as a number, it should by itself floor 7z, rar and the other common tools in the expected size. For media as well, with common repeating data in chunk lengths, it should be captured to a small size compared to other possibly bloated approaches. It is going for minimal compressed result with the components, working with the temporary mapping file and taking its time.

 

As for speed, ASM is suggested and skipping any GUI counter elements for direct processing, but it would involve reading a file a number of times, such as few hundred left-to-right depending, and reducing for each process on the temporary mapping file as it gets adjusted.

 

 

It is doable in reasonable time, using careful constraints to do minimal scans required, and also consider that to be thorough in this way, this is what is mathematically required.

Extraction once the file is tightly packed in, is simply looking at codes in the file and the code tree to extract, and is very fast and straightforward. The compressed file itself should be very tightly packed.

 

 

I hope this is enough detail for you.

The fact that you're concerned about skipping a GUI interface, when that will likely NOT be the factor that slows down processing, is very concerning to me. You are also asserting that time will not be a factor, when I've demonstrated that the naive implementation of part of your compression (nCr) absolutely WILL have time as a factor. Without further explanation, I can only attribute your claim to either a non-obvious implementation I haven't thought of which you haven't provided, or wishful thinking. And we come back to details. You are making claims that you cannot back up, because the details are not present to confirm or refute them.

 

What is almost certain is that you are going to take longer to process files than standard compression techniques. This is for the simple reason that instead of having a single pass to compress, there will be many. Moreover, these additional passes are more complicated. So, for a file I would expect 7zip to take an hour to compress (as an example), I would expect your algorithm to take around eight hours (assuming there are no computational time-bombs, which I suspect there are).

 

At this point, everything comes down to details. I cannot give you a lot of constructive feedback because there are very few details. There are ideas for algorithms, but no actual algorithms. There are ideas for processing strategies, but no file format specifications. There are claims of compression results, but no method to verify those claims.

 

Finally, here's the real world concern that makes me interested in your project: one of my jobs is working with customer databases to analyze issues they're having with our software. Around once every couple months, I need to get a database from a customer. We have an FTP site they can upload files to with a fiber connection to the internet. Customers are generally the limiting factor on uploading their files. Databases range from 100MB to 500GB in size. For a 100MB file, compressing it before uploading isn't really a factor, though it is a good idea. For a 500GB file, uploading it is almost never a feasible option. Most databases compress to around 1/10 their size with 7zip or WinZip. Your mechanism would have to compress files well enough to make FTP a viable option for large files, while doing it fast enough to account for the savings in time. In general, databases over 30GB are not good candidates for FTP, even after zipping.


Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#5 SoraK05

SoraK05

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 115 posts

Posted 24 November 2014 - 06:48 AM

I'm not a programmer. I can't give you pseudocode steps including all ints, loops or I would do it :)

 

I suggest an overview, which is something which can be transformed to an algorithm suitable for a computer language.

My steps if I were to write them down may be some hybrid to a high and low level language, and in general a programmer should be able to tell "minimal variable scan requires the scan at looking for individual bit strings over and over" and be able to scan for individual bit strings throughout the file, or 'look for heaviest combination of results for clone/cluster" which will scan the temporary mapping file for entries and combinations over and over.

 

While the exact steps are not there, this is something a programmer should be capable.

 

 

The 4 pattern scan has to do with separating compressible and uncompressible areas.

A stretch in a file may contain 1,2,3,4 2 bit patterns. A stretch may include 00,01,10 in different iterations, qualifying it as a 3 pattern.

A stretch of 00 for a while is a 1 pattern.

 

The point is any of the 1,2,3 stretches will indefinitely be able to compress providing the basic minimum length for overheads. A 3 pattern combination at the least can have one represented by 1 bit and the other 2 represented by 2 bits.

 

The 4 pattern areas require prevalence to compress, unlike the 1,2,3 which will indefinitely compress.

The point is to identify the suitable offset ranges for the 1,2,3 pattern areas to separate them from the 4 ones.

 

I have been thinking perhaps only using a 3 scan should suffice for this.

There is more that I am working on myself regarding being suitable to the data arrangements and areas in 4 pattern stretches, where for example some individual portions may qualify and to be able to dynamically compress the data to the specific content (like this particular region will be deemed compressed from having many repeating strings in it, as opposed to the same strings not constituting compression in the rest of the file).

 

 

This tool aims to be very specific to reducing a file to its minimum and looks for minimum results.

Even without being very specific to the data contents such as using the area scan of mode 0 or the variable pattern approach which I am thinking, this should beat 7z and in fair time.

 

The example of the BMP demonstrates something very basic about the capability of the tools out there to compress, where this should get it to the smallest you would expect and this would be the same effect on the general data all-round.

 

 

A game for example which has common file assets in a USA and PAL version would generally be able to have one specific string associate as the clone/cluster and have a single code represent each occurrence than having this data repeat. It can do the equivalent of looking at 'copies' which are 'random'. If a game has common files in the USA/PAL versions in general you will have one instance in the code tree and each occurrence will be a single code to represent.

 

7z works in chunks and looks for repetition of a pattern appearing (as bytes) left-to-right and associate a 'offset' and 'length' in general, with a repeat to accommodate things like 0 bits repeating in a row.

 

 

This in essence is a chunk constrained form of dealing with the equivalent of 'pattern / clone/cluster / in a row' in a way, and is far from efficient. It can do things quick using a sliding window and capturing the offset+lengths determined in its sliding window cache as it scans left-to-right.

 

This cannot deal with chunk length at all, and even for the pattern / clone/cluster / in a row it inevitably deals with and on 'ultra' its capacity using chunks and the thoroughness isn't programmed leaving things like the common game files completely avoided for the single entry it can be (and the minimal representation).

 

 

Common compression tools are very loose and cheaper forms with time/space in mind to capture pattern / clone/cluster / in a row. All these tools are generally incapable of much more, and don't even have a way to internally repeat and get more compression from the data shuffle after compression, where instead you may be able to get some more done if you recompress the file once or twice.

 

 

If you want to be thorough, you require the time and patience.

GUI counters and the code to consistently output the GUI itself can be very intensive for the computer, where calculations are constantly done to give data feedback and actually construct the image to output to screen.

You may find even if you have a counter to printf an int which is 0 and you increment to 16,777,216, if you printf each time it increments with a /n then when it gets to 16,777,216 it can take far more time than if you only tell it to print the one 16,777,216 at the end - this lack of accounting for GUI and counters to being with allow for intense and dedicated processing of which there is reasonable detraction.

 

 

The temporary mapping file is used to identify 'entries' before assigning a code tree to it, and the positions of entries. If you do a minimal variable string scan for example, you get the minimal variable string which fits the file, and then you need to mark down every location of its occurrence in the temporary mapping file in order to be able to accurately rescan the original data and skip all those locations in order to get the next results and so on. The final result allows for a code tree if desired and using the temporary mapping file to acknowledge which entries go to which exact positions so that the file is eventually written in conjunction to the temporary mapping file for the 'entries' and their 'positions'.

 

 

The writeup while giving some vague WIP [TO DO] and an expected overview for some, gives some defined expected description of what the tool should do that a programmer should be capable of. This detail in conjunction to the expectation of things like the temporary mapping file should be enough.

 

I have not perfected the expectation of storing the code tree, but this is something a programmer should know. I have not looked at the most economical and effective code tree and data construction. The premise however is to have the code tree(s), and have the ideal data file constructed after all data is gathered and the temporary mapping file is finalized.

 

There would be 2 code trees, where the first contains data pertaining to the strings/in a row as a number (according to the bit branch), and so on. After the temporary mapping file gets updated, the amount of total entries expected to be written may reduce altogether, as well as all the entries to be written and their positions. A second tree is made which refers to the first tree and this tree is the one used for the final file, so that all the data is fairly packed together in advance.

 

 

 

7z/rar and so on are cheap on space/time working with chunks, and aiming in general to deal with pattern (8-bit) / clone/cluster / in a row only, and are not thorough/effective in this even on their highest (by far as demonstrated by a very basic example looking at these components and even an internal repeat by the BMP, or even a 4GB file of 0 bits which should be few bytes and will be more).

 

This is definitely something aiming to be much more thorough than 7z/rar even if chunk length isn't in there, and files in general are expected to be far smaller.

 

If you look at benchmarks, you find that these tools are high low on the list and much more data can compress in files. I propose a detailed overview on a dynamic method to create multiple code trees to suit data arrangements for static/variable, which is something I know no tool to do - I only know of arithmetic accommodating for something similar.

 

 

This tool aims for smallest possible result with the components/elements (variable string, minimal ideal clone/cluster/copy, repetition in a row directly as a number) and the assembling of the file to suit when processed in advance..

 

Results should be far less than 7z/rar, and without GUI at all, should definitely be done in decent time and not the 1/8 hour example but less (when not using chunk length and mimicking a far more optimal result for what these tools look at).



#6 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts

Posted 24 November 2014 - 08:01 AM

1) I'm not looking for pseudocode. I'm looking for a concrete enough description of the logic to be able to analyze it. At a minimum, worked out examples with explanations for each type of logic would be required.

 

For reference, I can give you the formula for the quadratic formula, and most any programmer can implement it. If I further note requirements that the program should display both real and complex solutions, that's sufficient.

 

At this point, you don't even have a sufficiently detailed overview.

 

2) Your BMP example is highly unrealistic, as I already stated. It is pretty much the opposite of general data.

 

3) You missed my point about the GUI. It can be CPU intensive, but that doesn't mean it is the best point of optimization. it depends on how it's written. I have far more concerns about the internal logic than anything else.

 

4) You are attempting to make comparisons to 7z when you don't have an algorithm yet. Don't worry about how good your algorithm is compared to others, worry about having one.

 

5) You are right, being thorough takes time. However, as I tried to point out, your algorithm is potentially (we can't analyze it yet) so computationally intensive that it becomes useless.

 

Finally, I have a very deep concern with your responses. I have asked for two concrete things: more details, and more examples. The one thing you have not provided in your responses are those two things. The one thing that you have not provided since the LAST version of your WIP is those two things. So, I have one question:

 

When will you provide more details about how your compression works, and when will you provide additional examples?


Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#7 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts

Posted 24 November 2014 - 09:24 AM

As a side note, and to further the WIP: I am thinking about taking your current document and rewriting it. My goals are two-fold: first, to increase the general readability of the document and perhaps get more people involved, and second, to make it easier to identify gaps that need to be filled in.

 

If you have any suggestions about file formats, I'd be open to them. Alternatively, there may be value in reorganizing it as a wiki, so sections can be dealt with in discrete chunks and analyzed more carefully. A wiki page could also be kept as a living document to further discussion, hold code, etc.


Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#8 SoraK05

SoraK05

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 115 posts

Posted 24 November 2014 - 01:15 PM

That would be nice.

In regards to specific details and methods, the general descriptions in my opinion should be self explanatory to what the purpose and means of the approach is.

 

Perhaps when I have written down the basics I can look to make visual examples beyond the self-explanatory 'once you have entries, you can search for combinations of these and according to weight as prefixes to develop the most ideal minimal and heaviest set of combinations and thus dynamic clone/cluster results' for example.

 

To me this is a description of how to go about the process in a definitive manner, without diagrams.

 

I am still updating methods (TO DOs, perhaps shortcuts/reduce scanning). If you read each of the writing for the components such as minimal variable string, it mentions the exact method to approach this in writing and there is one example for this. When you look at chunk length, it mentions the constrained scanning of looking for all distances for patterns and then listing the frequencies to use by weight to acknowledge the ideal set of distances apart as well as a repeat similar to clone/cluster immediately to group these together to be more dynamic with multiple results in the same entry.

Also, where chunk length has a result such as '10 length' and this occurs for 1000 times, each of the '10 length' entries will all be aligned in the updated temporary mapping file so that when 'in a row' is run it gets all these as a '1000 times' entry. When the file is read ultimately, it will see the code for the 10 length and the next one as 1000 times, and these 2 codes in the file will cover the repetition in minimal representation.

 

There is relying on inheritance and appropriate writing in sequence on the components.

 

 

 

These things are written in the component sections. Some things are mentioned or missing and have some 'TO DO' with it and at a minimum an overview expectation.

 

The only thing missing atm beyond perhaps some idea for formatting the code tree / headering part is the part I mentioned about being dynamic for compressing specific regions to be more specific for compressible/uncompressible areas.

 

The dynamic multiple code tree part is mentioned with a written description.

 

Perhaps there are no examples but the method to approach the component is definitively written.

 

As I update more on my end, I can provide that. A wiki would be nice with each component described to be clearer. Each component does also mention variations for space/time to play with, such as static 8 bit strings than minimal variable strings (most common in tools), reading chunk length entries only left-to-right than dynamically in the file and more.

 

 

 

EDIT:

Depending on the file, there may be more or less scans left-to-right to determine the required data all-round, as well as depending on space/time constraining. Scans are continued once the temporary mapping file gets updated. A file may scan a few times, or up to few hundred times. When done intensely, few hundred times left-to-right with temporary updates of the mapping file should not be over-the-top.

 

These are the steps necessary, so beyond asm this is the mathematical requirement.

 

Setting the data to 8 bit and even forcing to do the data in chunks would give you something similar to most tools out there.

 

The great thing is the versatility once all the options are there.

 

And also my comparison to 7z is referring to the code behind 7z (sliding window + offset+length) and what it picks up in comparison to what this will providing on settings (and can be more thorough).

 

 

EDIT:

I did notice for chunk length scanning you can stop scanning when the distance to the next pattern is beyond half of the file, or is impossible for example. This constraint isn't mentioned and should reduce the total scanning.

 

EDIT:

In regards to file format, the temporary mapping file is used to identify entries and their positions before the final code trees are made.

 

There is a global code tree containing multiple bit branches with data to identify each bit branch which is for strings, which is a direct number for repetition in a row, which is a reference to go down the strings branch for example and combine results for an entry for clone/cluster and so on.

As the temporary mapping file reduces eventually to its more confined form, a second code tree is made at this point which relates to entries left to process and all is relied on the original code tree. An entry which is a clone/cluster will look at the original code tree for it and its coding which relates to the strings on that tree for example.

 

Keeping bit branches separate on the original code tree allows for distinguishing the separate components and treating the data differently. Keeping a secondary code tree specifically for the final form of the temporary mapping file relating to the main one allows for specific minimal codes and relevant to the layout of the final form.

 

An internal repeat for some position (MODE 2 only) parts will make another code tree which is used to relate to the second main data tree and is generally to accommodate newer entries specifically where applicable which are treated as layered when read for example.

If the dynamic approach for multiple code trees is done, there will be multiple code trees made for the data part (the second tree part), all with IDs and terminators as mentioned in the coding part. There is incremental coding as well, which can perhaps be ignored for now as it is more beneficial when compression constraints are more thorough.

Something dynamic for static/variable however would be nice and wouldn't take too long yet again as the temporary mapping file when ready should have a reduced amount of entries all round to write codes for.


Edited by SoraK05, 24 November 2014 - 01:40 PM.


#9 0xDEADBEEF

0xDEADBEEF

    CC Devotee

  • Senior Member
  • PipPipPipPipPipPip
  • 790 posts

Posted 24 November 2014 - 01:33 PM

Personally, i'd like to see a worked example of the first method - then once that's fully described the next, and the next.

 

I don't have any experience with compression methods - so I read with interest WingedPanthers posts.


Creating SEGFAULTs since 1995.


#10 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts

Posted 25 November 2014 - 05:38 AM

That would be nice.

In regards to specific details and methods, the general descriptions in my opinion should be self explanatory to what the purpose and means of the approach is.

 

I agree they should be, but they aren't. This is why I keep pressing you for more details.

 


Perhaps when I have written down the basics I can look to make visual examples beyond the self-explanatory 'once you have entries, you can search for combinations of these and according to weight as prefixes to develop the most ideal minimal and heaviest set of combinations and thus dynamic clone/cluster results' for example.

 

I strongly encourage you to start doing the examples now. There are two reasons for this:

1) It increases the clarity of your writing, making it easier for others to understand what you're doing. You have a lot of things in your head, including your vision of what the processing should look like, that we don't. Since your descriptions tend to be very light on the details, a worked out example should correct that issue.

2) I suspect that you haven't worked out examples for all your ideas, even in your head. When you actually work out an example, you can find solutions and problems with an idea that you won't find in a general statement. As an example: when I first read your nCr idea, my first thought was to iterate through ever single possibility. Since that would be a O(n!) process, I knew that would NEVER be a feasible strategy. As I thought about it, however, it occurred to me that I could use a strategy where I step down through the digits, calculating nCr (which is a O(n) calculation, so stepping through all is O(n^2)) on the remaining digits, and determine the "count" of that string as a reversible process. This would give a O(n^2) process for calculating the result. The compression ratio would then be strictly dependent on the imbalance between the binary digits. Of course, this ignores the fact that repetition analysis would probably give you better results faster :)

 

[Bunch of stuff skipped]

 

In your final edit, you talk a LOT about the code tree... do you actually spec out the code tree at all in your document? I realize I skipped a lot, but I would have expected discussion of the code tree to be critical to documenting the various parts of processing. For example, if you find a longish string that repeats often, I would expect the notation of where that string occurs, what it is, etc in the compressed file to be kind of critical information. What I would NOT expect is for it to result in a code tree. I would expect it to be a list of offsets.


Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/


#11 SoraK05

SoraK05

    CC Addict

  • Advanced Member
  • PipPipPipPipPip
  • 115 posts

Posted 25 November 2014 - 06:05 AM

The specific construction of the code tree part is not fully written. It is a general concept where one main code tree is used for the main file entries in bit branches which represent different ways to read data (one bit branch is for strings, another is for clone/cluster which looks at combinations of codes from the string bit branch, one is a plain number as is for repetition in a row etc). Once the temporary mapping file is updated, the amount of total entries will be reduced as well as positions/placements to write these entries. At that point a final code tree can be done which relates to the entries, and its data is directly linking to the relevant bit branch and code on the main tree. This way using a main tree and then final code tree you can have specific minimal and suitable codes for the final temporary mapping data.

 

There is a method to dynamically look through variable/static areas for multiple trees which can create multiple code trees, as well as perhaps can be applied on the original tree to save some few bytes and allow for some more (smaller) files in the data spectrum to compress.

 

 

I don't have examples. However, each area which does not have [TO DO] has a written description of what is expected. There is no direct example, but something like clone/cluster does have a written expectation to do something similar to the variable string scan but specifically on the entries made in the temporary mapping file to get groups of entries which are eligible and the most suitable group (i.e. it can be mixed, with a result of 2, 4, 3, 2 etc which are the exact suitable combinations).

Examples may assist but the description should be there. Only [TO DO] doesn't have an exact description which can be used and may be more vague than another which is closer to a description.

 

 

It is also WIP. Variations such as applying 7z type method for listing offsets rather than direct coding, as well as ironing out the very specific constraints for when to use which method is still to be done. The different methods are mentioned. There is a vague mention of using something like nCr when counting all 0/1 bits and the ratio is above a certain percentage. There is a vague mention of using something like gap length when the ratio is below nCr and higher than strings, and strings when closer to 50%.

With more development on methods and applying, the boundaries for highest compression for constraints of when to use which method can be determined.

 

As for nCr you expect to only use it when the ratio is an overwhelming amount such as 90%/10%, and the expectation is for the file to be the smallest possible when using this, the value of iterations when the ratio is as such is expected to be at a minimal.

For the nCr component there could be ways to jump and skip in a mathematical manner based on the placement and in odd/even positions to be quicker to get the iteration number than doing a scan from 0. This is something which is more to do with the coding of the components, and is of the finer steps.

 

These are things which can all be addressed, and for the purpose of being thorough not only for smallest size but for the coding steps. This is open for suggestion.

I do however think that something which can be more thorough even with only strings from the MODE 1 selection (<a lot> more than other tools) would be great to use for general files as it develops.

When you decide to read through and perhaps re-write/make a wiki you can see how much has descriptions for example and what can be added to be more specific.

I have left some to the discretion of the programmer before doing exact specific research such as precise code tree layout or code steps for things.

 

 

Also, clone/cluster as mentioned including part of the information below should be an assist to deal with the actual compression and smaller representation of a longer stretch of data which is also a clone somewhere else rather than keeping it raw or slightly compressed with a reference to its offset for example.

 

I updated my [TO DO] with some ideas btw.

 

 


- Deal with 4 Pattern Areas specifically
Where there is only a 4 pattern area in the file, or with 4 pattern areas in general, one can look at other 1-3 pattern data to compress what is possible.
However, it is possible that the 4 pattern area has compression of its own. It may have something repeating in chunk length or in a row mainly throughout, or could have secluded areas with prevalent clone data.

The 4 pattern area can be dealt with individually for a full process, and from whatever compression is found like chunk length/in a row these specific parts are considered while the rest is deemed raw where applicable.

Where some areas may be suitable for clones/clusters in a generally prevalent area, a scan has to be done on data to determine from the closeness of compressed data (and using other data patterns from 1-3) which areas are eligible to be counted for compression overall and break even for lengths of raw/compressed areas.

**IDEA**
Using the data from the pattern scans (a 3 pattern scan should get results for 1/2/3 areas which are all indefinitely compressible since a 3 pattern stretch can always use 1 bit, 2 bit, 2 bit) retain the proportion of the dimensions of the stretches as the file is scanned and processed.
At the end, after the dynamic code tree scan is done, before writing the contents and keeping the proportions of the stretches of the 4 specific areas, scans can be done dynamically on the weight of entries for regions where some content can be compressible as it is and other areas are not. This compares the original data length against the coded data lengths to confirm specific regions to be compressible/uncompressible. Considerations for clone/cluster, chunk length and in a row and population and weight in a region is also considered, in order to be specific to locate the specific areas which do not break even against those which do. The final results get length entries for codes which specifically are for distinguishing the length between the next section which will be treated as raw when written up to the part of data which is code data.

- Weights and priorities (clashes)
Iron out all regarding methods, weights and priorities (clashes) that the smallest result will dyamically be achieved.

- Shortcuts
Work on all applicable constraints. An example for chunk length is avoiding scans to seek for the next possible hit where the length according to the file is impossible.

- Incorporate Codes
Where clone/cluster is done, it is possible that it can clash or also incorporate results from the chunk length and in a row which follows. Similar to chunk length, its result can clash or incorporate results from in a row which follows.

For both clone/cluster and chunk length, results which are specifically in a row are acknowledged in advance and ignored, only marking an entry point at the first entry so that the in a row pass will get these accurately.

For this reason, when clone/cluster is done, its results are acknowledged and its regions are marked on the temporary mapping file but are not updated. When chunk length is done, any effects done in the regions qualified for the previous clone/cluster are updated in the data for this. Similar to chunk length, it remains as is and any in a row is treated for what it is.
The final results for the clone/cluster and chunk length results are taken and updated incorporating the mixed results for the variations in order to accurately have a non-clashed and smallest form for each entry.


Edited by SoraK05, 25 November 2014 - 06:12 AM.


#12 WingedPanther73

WingedPanther73

    A spammer's worst nightmare

  • Moderator
  • 17757 posts

Posted 25 November 2014 - 08:03 AM

SoraK, you wrote a LOT about the ideas of the code tree, but not a spec. There was no example. Just general ideas about what it might look like that may or may not conform at all with the various processing techniques you would use.

 

When I think, for example, about "3 instances of '000100100' at positions 93, 472, and 823", that doesn't correspond to a tree at all in my mind. I can't envision placing it into a tree. I'm not even convinced that a tree will be the appropriate overall data structure for your compressed file.

 

So, you wrote numerous paragraphs that don't communicate meaning.

 

I almost get the impression that you're collecting ideas for things you could do faster than you're getting them down and hammering out details. At some point, you have to stop brainstorming and start putting meat on them.


Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

My MineCraft server site: http://banishedwings.enjin.com/





Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download