Greetings everyone, this is my first time posting on these forums, and I am at my wits end. Over the last few years I have been diligently wracking my brain to complete a very simple compression program that I believe has just been overlooked because of heavy mathematic algorithm programs that offer very marginal compression on the files people really want to compress IE random data streams like video and audio.
So to address these complex, memory and processing heavy applications required to decompress and display heavily processed raw data. I've come up with a system which may seem odd and bizarre at first, but which has the potential to change the way data is stored.
First envision a system where only one true copy of the original file in it's full size ever has to exist (that's like me owning the only original DVD or Blu-ray).
Every other manifestation of the original file need only take up 50-60 bytes per 1kb. 93% compression.
If you are a mathematic purist, you've probably already stopped reading this, because like so many others, you are a firm believer in the way things are now. I'm not talking about thinking outside the box, in fact, i'm talking about thinking so far inside the box, that it's like the contents inside the box aren't even aware that the box exists (IE mankind vs the entire universe). Mathematics are said to be the language of the universe. Yet that language is unable to explain more than it can.
Ok, enough rambling from me. I apolgize if i've offended any mathematicians who decide to read further.
The theory:
PI - it's been around for thousands of years, it's even represented in the bible and seen observed in dozens of ancient mathematics.
We've deciphered PI to the 2.3 Trillionth digit past the decimal in our quest to truely find it's end. In my theory we've looked so far into PI already we've seen both our past, present, and future. There are 10 deminsions to mathematics - 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. everything else is just a translation from this system, binary uses 1 and 0, and hexdecimal uses A-F to represent the combinations of 10, 11, 12, 13, 14, and 15.
It truely is a fundamental number to everything about the way we live, and everything about the way we will live in the future.
I hypothesize that if we connect to PI, if we reinvision it's use as a tool instead of a contest to see which computer is the fastest, we can finally overcome dozens of challenges that have prevented us from progressing as a civilization.
Ok, now for some numbers and the heart of how this works:
Taking the first 16 Digits of PI past the decimal.
1415 9265 3589 7932
All decimal digits, comprising of 0-9. Now let's decompose the number into binary representations of yes or no (yes this is the number and no it's not the number)
98 7654 3210
00 0000 0001
00 0000 0010
00 0000 0100
00 0000 1000
00 0001 0000
00 0010 0000
00 0100 0000
00 1000 0000
01 0000 0000
10 0000 0000
(in hex)
001 applied to 1415 9265 3589 7932 = 0000 0000 0000 0000
002 applied to 1415 9265 3589 7932 = 1010 0000 0000 0000
004 applied to 1415 9265 3589 7932 = 0100 0000 0000 0000
008 applied to 1415 9265 3589 7932 = 0000 0000 1000 0010
010 applied to 1415 9265 3589 7932 = 0100 0000 0000 0000
020 applied to 1415 9265 3589 7932 = 0001 0001 0100 0000
040 applied to 1415 9265 3589 7932 = 0000 0010 0000 0000
080 applied to 1415 9265 3589 7932 = 0000 0000 0000 1000
100 applied to 1415 9265 3589 7932 = 0000 0000 0010 0000
200 applied to 1415 9265 3589 7932 = 0000 1000 0001 0100
I know this doesn't seem impressive, in fact it's quite simple. Although, if you
apply the same method to the first 8192 digits of PI tho, and expand the results to include not just the primary 10 examples above but all 1022 possiblities of 001-3FE (000 and 3FF would produce all 1's) and you've created 1022 x 8192 bits (1kb) = 1022 KB of data, that did not previously exist, and all you would need to access any of those strings of data is the address where to go into PI (1 place) and which 10 bit key to use to retrieve the 1kb. So regardless of the random nature of the first 8192 digits of PI, we've just made a system that would allow you to compress those strings of data into less than 2 bytes. Let's expand this a bit:
Let's say we capped ourselves at 40,008,191 digits of PI, that's 40,000,000 sets of data, that we can retrieve 10 base keys for each 8192 bit sample (shifting over 1 bit per sample range) So we have a sample data size that is
~20MB. And we have 400,000,000 1kb templates. It is safe to assume that we can create any string of 1kb using the templates found from this system. And all we would need is about 6 bytes per template for addressing. And to limit ourselves to using no more than 10 templates to create a 1kb segment. Even if we didn't limit ourselves to using only 10 templates we could use a little over a hundred before we ever reached the initial size of the input. I know there will probably be a large amount of data that we can't create using this system. But that doesnt' mean this system shouldn't be explored. There are alot of open ends, and billions of possiblities. Such as: couldn't we try to compress the addresses that have already be "translated" down to 60 bytes? then apply the system all over again? does this unlock recursive compression? wouldn't that break mathematical laws in place? does this already prove that mathematical laws in place are bogus? Does this system use any real mathematics anyways?
In nature when 2 beings of extreme complexity mate - you get another being of extreme complexity that is completely different in every way from the parents - however, they still maintain the basic characteristics of the species, etc.
All i'm saying is, maybe this system, altho seemingly impossible to implement is in the same boat that the ill fated desktop computer, laptop, cell phone, electric car, and cold fusion where/are in.
Thank you for your time.
William Dover
Compression Program Theory - need an honest opinion
Started by Lightdark2, Jan 17 2009 06:00 PM
29 replies to this topic
#1
Posted 17 January 2009 - 06:00 PM
|
|
|
#2
Posted 17 January 2009 - 07:08 PM
Aside from the fact that it won't work, it's very nice. Here is the fundamental problem: there are (2^8)^1024 different possible 1kB files. There are (2^8)^61-1 different files less than 61 Bytes. Your desire to use PI is nice, but I suspect you haven't accounted for the necessary space to find the digits you need. I think you're assuming that the digits will accommodate what you want. I'm sorry, but they just won't.
#3
Posted 18 January 2009 - 07:45 PM
Ok, i see your point. And it's true, the system cannot make every possiblity, but it was never designed to make every possibility, it will only create instances of purely random sequences. Sequences that cannot be compressed via normal compression algorithms based on mathematic principles. Because PI is a purely random number, if i where to create 10 sequences per every 8192 digit sample, for the first 40,008,191 digits of PI. I would then have 400,081,910 x 1kb of data that would be completly accessible via 6 bytes of addressing. That's a drop in the bucket compared to the possibilities of merging each of those sequences with each other and melding them out into unique strands up to 60 bytes (10 strings combined),
(1 string + 1 string) = 12 bytes of addressing
(1 string + 1 string + 1 string) = 18 bytes of addressing
(1 string + 1 string + 1 string + 1 string) = 24 bytes of addressing
etc, it'll basically exponentially increase until all unique strands are found and addresses are compiled.
Will it take alot of resources = yes
What doesnt' take alot of resources?
And also, let's not restrict ourselves to PI as the only number this system will work with, Part of that 6 bytes of addressing allows for a pattern switch. PI, 1/PI, e, etc. Any infinitely long random number that a home based computer can easily manifest the first 40 million digits of, is fair game. The addresses of them all will still remain 6 bytes per 1kb strand. Merge some variables from PI with e, and who knows how many unique combinations are found, the possibilities are as limitless as 2^8192 power (of course only the ones that are random tho).
-WD
(1 string + 1 string) = 12 bytes of addressing
(1 string + 1 string + 1 string) = 18 bytes of addressing
(1 string + 1 string + 1 string + 1 string) = 24 bytes of addressing
etc, it'll basically exponentially increase until all unique strands are found and addresses are compiled.
Will it take alot of resources = yes
What doesnt' take alot of resources?
And also, let's not restrict ourselves to PI as the only number this system will work with, Part of that 6 bytes of addressing allows for a pattern switch. PI, 1/PI, e, etc. Any infinitely long random number that a home based computer can easily manifest the first 40 million digits of, is fair game. The addresses of them all will still remain 6 bytes per 1kb strand. Merge some variables from PI with e, and who knows how many unique combinations are found, the possibilities are as limitless as 2^8192 power (of course only the ones that are random tho).
-WD
#4
Posted 19 January 2009 - 09:00 AM
In what sense are you saying that PI is a purely random number? If you mean that the sequence of digits exhibits patterns consistent with random sequences of digits, I can see your point. Unfortunately, that doesn't necessarily mean that you have a good match within PI for the data you are trying to compress. Also, once you start looking at various functions of e and PI, you start adding to the complexity of both the search and the space to store that function.
I would be interested in seeing a working example of your idea. I suspect you will find that the devil's in the details of actually getting it to work.
I would be interested in seeing a working example of your idea. I suspect you will find that the devil's in the details of actually getting it to work.
#5
Posted 19 January 2009 - 02:12 PM
It's not really about matching strings to already existing data. It's about forming as many different strings as possible, using the least amount of data for addressing. And even if it wasn't used for general compression practices consider the following:
Let's say i use 4 different strings of data, and only the first 8 digits of each string.
00 = PI
01 = 1/PI
10 = e
11 = 1/e
let's say we only use even numbers as 0's and odd numbers are 1's and switch them with 1 bit, 0=even 1 = odd as 1's
000 = first 8 digits of PI, odd numbers are 0, even numbers are 1
001 = first 8 digits of PI, odd numbers are 1, even numbers are 0
010 = first 8 digits of 1/PI, odd numbers are 0, even numbers are 1
011 = first 8 digits of 1/PI, odd numbers are 1, even numbers are 0
100 = first 8 digits of e, odd numbers are 0, even numbers are 1
101 = first 8 digits of e, odd numbers are 1, even numbers are 0
110 = first 8 digits of 1/e, odd numbers are 0, even numbers are 1
111 = first 8 digits of 1/e, odd numbers are 1, even numbers are 0
PI = 1415 9265
1/PI = 3183 0988
e = 7182 8182
1/e = 3678 7944
000 result = 0100 0110 in hex = 46
001 result = 1011 1001 in hex = B9
010 result = 0010 1011 in hex = 2B
011 result = 1101 0100 in hex = D4
100 result = 0011 1011 in hex = 3B
101 result = 1100 0100 in hex = C4
110 result = 0101 0011 in hex = 53
111 result = 1010 1100 in hex = AC
Very simple and to the point. We now have 8 strings of 1 byte that can be compressed to 3 bits indefinitely.
Let's alter it a little bit. Let's make the program cycle through.
02468 and 13579
x0000 x0000
0x000 0x000
00x00 00x00
000x0 000x0
0000x 0000x
So as that
PI = 1415 9265
1/PI = 3183 0988
e = 7182 8182
1/e = 3678 7944
000 result =
all = 0100 0110 = 46 in hex
0only=0000 0000 = 00 in hex
2only=0000 0100 = 04 in hex
4only=0100 0000 = 40 in hex
6only=0000 0010 = 02 in hex
8only=0000 0000 = 00 in hex
001 result =
all=1011 1001 = B9 in hex
1only=1010 0000 = A0 in hex
3only=0000 0000 = 00 in hex
5only=0001 0001 = 11 in hex
7only=0000 0000 = 00 in hex
9only=0000 1000 = 08 in hex
010 result =
all=0010 1011 = 2B in hex
0only=0000 1000 = 08 in hex
2only=0000 0000 = 00 in hex
4only=0000 0000 = 00 in hex
6only=0000 0000 = 00 in hex
8only=0010 0011 = 23 in hex
011 result =
all=1101 0100 = D4 in hex
1only=0100 0000 = 40 in hex
3only=1001 0000 = 90 in hex
5only=0000 0000 = 00 in hex
7only=0000 0000 = 00 in hex
9only=0000 0100 = 04 in hex
100 result =
all=0011 1011 = 3B in hex
0only=0000 0000 = 00 in hex
2only=0001 0001 = 11 in hex
4only=0000 0000 = 00 in hex
6only=0000 0000 = 00 in hex
8only=0010 1010 = 2A in hex
101 result =
all=1100 0100 = C4 in hex
1only=0100 0100 = 44 in hex
3only=0000 0000 = 00 in hex
5only=0000 0000 = 00 in hex
7only=1000 0000 = 80 in hex
9only=0000 0000 = 00 in hex
110 result =
all=0101 0011 = 53 in hex
0only=0000 0000 = 00 in hex
2only=0000 0000 = 00 in hex
4only=0000 0011 = 03 in hex
6only=0100 0000 = 40 in hex
8only=0001 0000 = 10 in hex
111 result =
all=1010 1100 = AC in hex
1only=0000 0000 = 00 in hex
3only=1000 0000 = 80 in hex
5only=0000 0000 = 00 in hex
7only=0010 1000 = 28 in hex
9only=0000 0100 = 04 in hex
So now we have 8 variables with 6 sub variables
Now the way the program should cycle through the sub variables is pretty easy.
let's say i had 04 in my data, and i wanted to compress it. ok, now the trick to this is to divide the data up into timed segments, let's say we group all the data in the file into groups of 10, and this particular 04 appears in the fourth position in the 10 segment slice it's in:
xx xx xx 04 xx xx xx xx xx xx
now the program won't know that the data in the fourth position is approximately 04. but what it will know is that the 2 byte checksum for entire 10 byte segment is 28 19 so now we need to store the checksum of 2 bytes for each 10 segment slice of data. (this is just an example the segment size could easily be much larger) so we find one of our 8 base variables that has an 04 segment. in this case we'll pick the first one that comes up 000 - the 2only sub-variable is 04. Now because we have to have a 2 byte checksum stored for the 10 segements, we'll always have at least (16bits+30 bits)/80bits for every 10 byte segment. But we'll decrease the overall size of the file by ~42%
23 11 28 04 2A 90 44 03 46 AC
23 = 010
11 = 100
28 = 111
04 = 000
2A = 100
90 = 011
44 = 101
03 = 110
46 = 000
AC = 111
of course this is just sample data, there a dozen ways this method could be interpretted and used. Just something to think about i guess, i'm only one person, there's no way i could think of every hang up, if i could i guess i wouldnt' be asking for opinions =)
-WD
Let's say i use 4 different strings of data, and only the first 8 digits of each string.
00 = PI
01 = 1/PI
10 = e
11 = 1/e
let's say we only use even numbers as 0's and odd numbers are 1's and switch them with 1 bit, 0=even 1 = odd as 1's
000 = first 8 digits of PI, odd numbers are 0, even numbers are 1
001 = first 8 digits of PI, odd numbers are 1, even numbers are 0
010 = first 8 digits of 1/PI, odd numbers are 0, even numbers are 1
011 = first 8 digits of 1/PI, odd numbers are 1, even numbers are 0
100 = first 8 digits of e, odd numbers are 0, even numbers are 1
101 = first 8 digits of e, odd numbers are 1, even numbers are 0
110 = first 8 digits of 1/e, odd numbers are 0, even numbers are 1
111 = first 8 digits of 1/e, odd numbers are 1, even numbers are 0
PI = 1415 9265
1/PI = 3183 0988
e = 7182 8182
1/e = 3678 7944
000 result = 0100 0110 in hex = 46
001 result = 1011 1001 in hex = B9
010 result = 0010 1011 in hex = 2B
011 result = 1101 0100 in hex = D4
100 result = 0011 1011 in hex = 3B
101 result = 1100 0100 in hex = C4
110 result = 0101 0011 in hex = 53
111 result = 1010 1100 in hex = AC
Very simple and to the point. We now have 8 strings of 1 byte that can be compressed to 3 bits indefinitely.
Let's alter it a little bit. Let's make the program cycle through.
02468 and 13579
x0000 x0000
0x000 0x000
00x00 00x00
000x0 000x0
0000x 0000x
So as that
PI = 1415 9265
1/PI = 3183 0988
e = 7182 8182
1/e = 3678 7944
000 result =
all = 0100 0110 = 46 in hex
0only=0000 0000 = 00 in hex
2only=0000 0100 = 04 in hex
4only=0100 0000 = 40 in hex
6only=0000 0010 = 02 in hex
8only=0000 0000 = 00 in hex
001 result =
all=1011 1001 = B9 in hex
1only=1010 0000 = A0 in hex
3only=0000 0000 = 00 in hex
5only=0001 0001 = 11 in hex
7only=0000 0000 = 00 in hex
9only=0000 1000 = 08 in hex
010 result =
all=0010 1011 = 2B in hex
0only=0000 1000 = 08 in hex
2only=0000 0000 = 00 in hex
4only=0000 0000 = 00 in hex
6only=0000 0000 = 00 in hex
8only=0010 0011 = 23 in hex
011 result =
all=1101 0100 = D4 in hex
1only=0100 0000 = 40 in hex
3only=1001 0000 = 90 in hex
5only=0000 0000 = 00 in hex
7only=0000 0000 = 00 in hex
9only=0000 0100 = 04 in hex
100 result =
all=0011 1011 = 3B in hex
0only=0000 0000 = 00 in hex
2only=0001 0001 = 11 in hex
4only=0000 0000 = 00 in hex
6only=0000 0000 = 00 in hex
8only=0010 1010 = 2A in hex
101 result =
all=1100 0100 = C4 in hex
1only=0100 0100 = 44 in hex
3only=0000 0000 = 00 in hex
5only=0000 0000 = 00 in hex
7only=1000 0000 = 80 in hex
9only=0000 0000 = 00 in hex
110 result =
all=0101 0011 = 53 in hex
0only=0000 0000 = 00 in hex
2only=0000 0000 = 00 in hex
4only=0000 0011 = 03 in hex
6only=0100 0000 = 40 in hex
8only=0001 0000 = 10 in hex
111 result =
all=1010 1100 = AC in hex
1only=0000 0000 = 00 in hex
3only=1000 0000 = 80 in hex
5only=0000 0000 = 00 in hex
7only=0010 1000 = 28 in hex
9only=0000 0100 = 04 in hex
So now we have 8 variables with 6 sub variables
Now the way the program should cycle through the sub variables is pretty easy.
let's say i had 04 in my data, and i wanted to compress it. ok, now the trick to this is to divide the data up into timed segments, let's say we group all the data in the file into groups of 10, and this particular 04 appears in the fourth position in the 10 segment slice it's in:
xx xx xx 04 xx xx xx xx xx xx
now the program won't know that the data in the fourth position is approximately 04. but what it will know is that the 2 byte checksum for entire 10 byte segment is 28 19 so now we need to store the checksum of 2 bytes for each 10 segment slice of data. (this is just an example the segment size could easily be much larger) so we find one of our 8 base variables that has an 04 segment. in this case we'll pick the first one that comes up 000 - the 2only sub-variable is 04. Now because we have to have a 2 byte checksum stored for the 10 segements, we'll always have at least (16bits+30 bits)/80bits for every 10 byte segment. But we'll decrease the overall size of the file by ~42%
23 11 28 04 2A 90 44 03 46 AC
23 = 010
11 = 100
28 = 111
04 = 000
2A = 100
90 = 011
44 = 101
03 = 110
46 = 000
AC = 111
of course this is just sample data, there a dozen ways this method could be interpretted and used. Just something to think about i guess, i'm only one person, there's no way i could think of every hang up, if i could i guess i wouldnt' be asking for opinions =)
-WD
#6
Posted 19 January 2009 - 02:31 PM
Here's the biggest problem I see:
You are not working from the string to be compressed to the compressed form, but instead creating compressed forms with meaningful uncompressed forms.
You found 8 bytes that could be compressed in your initial example. Searching through random data, there are 256 possibly bytes, which should appear with equal probability. So, for 1kB, we would expect to have approximately 32 matches, saving us 32x5 = 160 bits or 20 Bytes. However, we need to note where we have compressed data. That will introduce a new overhead, as we can't readily tell which 3 bit substrings of the compressed data are compressed data and which are just raw (uncompressed) data. I would be surprised if the 20 Bytes saved don't get chewed up by the simple process of noting where the savings are for decompression.
Based on my calculations, on truly random data, I don't see where you are getting 42% compression. I'm seeing less than a 1% reduction in data size, at best. I'll be happy to write a short application that will generate small amounts of "noise data" to do the compression against. What I would be looking for in a sample is two programs (compress, decompress) that consistently:
1) get significant compression on the noise data
2) can decompress the compressed data
3) do not have to be retooled for each new data set
My prediction is that you will discover that the additional data you have to add to enable decompression is almost always larger than the size saved in the compressed form of the data.
Also, please understand that I'm not trying to be difficult. I've had the experience of having a great idea (more than once) that I had convinced myself of. The problem was when I tried to make it work in practice, it simply didn't pan out. Implementation of an idea is a great way to discover blind spots and oversights.
I can see the appeal of using randomness to fight a problem of randomness. In principle, it should be possible to compress a file that exhibits suitable signs of being random data by finding the seed that would generate that data. Finding such a seed may or may not be possible, however, and certainly would not be practical.
You are not working from the string to be compressed to the compressed form, but instead creating compressed forms with meaningful uncompressed forms.
You found 8 bytes that could be compressed in your initial example. Searching through random data, there are 256 possibly bytes, which should appear with equal probability. So, for 1kB, we would expect to have approximately 32 matches, saving us 32x5 = 160 bits or 20 Bytes. However, we need to note where we have compressed data. That will introduce a new overhead, as we can't readily tell which 3 bit substrings of the compressed data are compressed data and which are just raw (uncompressed) data. I would be surprised if the 20 Bytes saved don't get chewed up by the simple process of noting where the savings are for decompression.
Based on my calculations, on truly random data, I don't see where you are getting 42% compression. I'm seeing less than a 1% reduction in data size, at best. I'll be happy to write a short application that will generate small amounts of "noise data" to do the compression against. What I would be looking for in a sample is two programs (compress, decompress) that consistently:
1) get significant compression on the noise data
2) can decompress the compressed data
3) do not have to be retooled for each new data set
My prediction is that you will discover that the additional data you have to add to enable decompression is almost always larger than the size saved in the compressed form of the data.
Also, please understand that I'm not trying to be difficult. I've had the experience of having a great idea (more than once) that I had convinced myself of. The problem was when I tried to make it work in practice, it simply didn't pan out. Implementation of an idea is a great way to discover blind spots and oversights.
I can see the appeal of using randomness to fight a problem of randomness. In principle, it should be possible to compress a file that exhibits suitable signs of being random data by finding the seed that would generate that data. Finding such a seed may or may not be possible, however, and certainly would not be practical.
#7
Posted 19 January 2009 - 04:20 PM
It is very difficult to express most of my thinking into plain text for all to read and understand. I wish i had the knowledge to program so i could find for myself whether or not an idea worked before making myself appear idiodic.
I really do appreciate your input Winged. And if you are willing to go through the trouble of making a program to prove/disprove the feasiblity, I welcome your findings.
I guess the easisit way to do it, if it where me programming this. Would be to take the first 40 million digits of each string (PI, 1/PI, e, and 1/e) (approx 20MB appiece)
And use all 1022 keys on each string creating 1022 unique 20MB strands (if you have the disk space, could always try this with just one string)
Then create a 2 part checksum (2 seperate 4 byte hashes ADLER32 and CRC32) for each 8192 binary set, (moving over 1 binary digit per set, this part will also take alot of disk space) After your checksum table is complete, then try to get a checksum match for any input string of 8192 bits (created with a random data generator, or from real data)
That would be the most thorough way to test the system. Altho resource intensive, it would not take long to program i imagine. Tho PI will be the easist to obtain the first 40 million digits of.
-WD
I really do appreciate your input Winged. And if you are willing to go through the trouble of making a program to prove/disprove the feasiblity, I welcome your findings.
I guess the easisit way to do it, if it where me programming this. Would be to take the first 40 million digits of each string (PI, 1/PI, e, and 1/e) (approx 20MB appiece)
And use all 1022 keys on each string creating 1022 unique 20MB strands (if you have the disk space, could always try this with just one string)
Then create a 2 part checksum (2 seperate 4 byte hashes ADLER32 and CRC32) for each 8192 binary set, (moving over 1 binary digit per set, this part will also take alot of disk space) After your checksum table is complete, then try to get a checksum match for any input string of 8192 bits (created with a random data generator, or from real data)
That would be the most thorough way to test the system. Altho resource intensive, it would not take long to program i imagine. Tho PI will be the easist to obtain the first 40 million digits of.
-WD
#8
Posted 20 January 2009 - 05:55 AM
I'm not sure I have a clear enough understanding of the proposed mechanisms to be able to start implementing this. My sense is that you have not fully specified the output file format (I could be wrong). Without that basic information, I'll be hard pressed to make anything for you.
While expressing some of these ideas clearly may be difficult, it has to be done. Whether you end up coding it or someone else does, the process cannot intelligently start without that. Also, until the ideas are stated clearly, it is almost impossible to start probing it for flaws. My training is in mathematics, and one of the things I encountered repeatedly was that I could come up with a good strategy for solving a problem, but getting all the details right was hard. If the details weren't quite right, the whole thing fell apart. I've experienced the same thing in programming. When migrating data from one system to another, for example, if I don't convert every single item correctly, I end up with missing data (at best) or the program crashing (at worst). Even the tutorials I write need lots of checking. I've had tutorial code crash on a regular basis before I get everything worked out in detail.
I know getting all the details down is hard, but it is necessary. What I suspect you will discover is that there are some key flaws in your assumptions. For example, what properties do the digits of pi have to have for your theory to work? Do the digits of pi actually exhibit those properties?
While expressing some of these ideas clearly may be difficult, it has to be done. Whether you end up coding it or someone else does, the process cannot intelligently start without that. Also, until the ideas are stated clearly, it is almost impossible to start probing it for flaws. My training is in mathematics, and one of the things I encountered repeatedly was that I could come up with a good strategy for solving a problem, but getting all the details right was hard. If the details weren't quite right, the whole thing fell apart. I've experienced the same thing in programming. When migrating data from one system to another, for example, if I don't convert every single item correctly, I end up with missing data (at best) or the program crashing (at worst). Even the tutorials I write need lots of checking. I've had tutorial code crash on a regular basis before I get everything worked out in detail.
I know getting all the details down is hard, but it is necessary. What I suspect you will discover is that there are some key flaws in your assumptions. For example, what properties do the digits of pi have to have for your theory to work? Do the digits of pi actually exhibit those properties?
#9
Posted 20 January 2009 - 05:30 PM
My thinking (and this will probably resemble the ramblings of a lunatic) is as follows:
When i first started playing around with this i wanted to create a compression method that used the idea - that all pieces of data will eventually be found within PI at some point.
And the best way for me to describe the sensation of wanting to create something like that would be this example:
Think of a single 3 letter word - car.
Now in your mind, think of every single possible thing you can think about relating them all to that one single word. Images, Video, Names, parts, you can generate billions and billions of pieces of knowledge and draw them from your mind, from one single word. Now i'm not sure if that's true compression, but it is a system of addressing that our own minds use to draw out meaningful data, related to a single thought. Now if the human mind is just a bio-chemical computer, it should be systematically possible to do the same thing with a normal computer system. Now when we "draw" data from our minds, we aren't thinking of mathematical equations to do it, we're simply doing it. No middleman.
Think of all the information in your mind, streaming from a finite stream of data held not within individual memory cells in your mind. But from code stored directly in your DNA. Now that code may just be a simple equation that is processed out via the synapsis in your brain like a normal computer processing binary. But when it makes a match to something stored within that code to an actual thought you've had - then the location within the code where that thought was found is stored, not the actual thought - hense the recall mechanism. which sadly is not perfect - you may notice a recalled memory is malleable - meaning it doesn't always contain the whole truth - just what you recall on the fly - now the more you concentrate on it (the more you let the patterns fill in from the code) the clearer your memory of an event may become. The next point to that is the human minds "boot-mode" direct memory access = hypnosis. Which allows the mind to stop all other functions but basic ones - and concentrate on regenerating an image or an instance stored to near 100% quality. So even the powerful human mind - needs a near stop to process out the code completely.
We have a few benefits over the human mind:
1) all data generated from this system is pure, no "imagined" information
2) no background programs running (the 5 senses, etc) while trying to process information
3) a systematic approach to the entire system, instead of on the fly storage - as found with the mind (sometimes short-term memory will fail to make it completely into long term) meaning the human mind has limited short term (ram) and quite a bit of Long term (rom). And the process of uploading the ram to rom is REM sleep.
I'm sure there are an infiinte amount of other variables in the human brain scenario, but that's just my take on why i wanted to create this system like this.
So basically the first draft of this idea looked at only decimal digit randomization - and by looking at the data obtained by kanada labratories (the japanese lab that calculated PI out to the 2.3 trillionth digit) every single digit appears at roughly the same frequency, no matter how big the sample size they chose was. Which was perfect for my idea. Also the fact that PI has way more sample data to pick through than any other sequence.
So basically i found myself needing a way to isolate the frequence of each indivicual decimal digit. And that was easy, just use a binary switch scenario. 0 for off, 1 for on. I found that i could create 1022 unique keys of any number of sample sizes within PI. 32, 64, 128, 256 etc digits. Now, the reason i've limited myself to only 40 million digits as a beginning frame for the entire program is because it takes a normal home computer, running at 2 ghz, a little under 10 minutes to manifest all 40 million digits from the equation flawlessly. Meaning if a program where ever distributed it could simply be the program that calculated 40 million digits of PI into memory. (20MB) Also the fact that i realized i could use around 7 bytes per sample (8192 digit sample)
00 00 00 00 00 00 00
16x16x16x16x16x16x16 = 268,435,456
So if i went to the first digit in PI past the decimal, and wanted the sample for code 281 (hex) out of 3FE (hex) possible codes. My address would look like
F0 00 00 00 00 12 81
The F at the beginning is the control Bit, it's purpose is to tell the program whether the address is solo, or joint. an F tells the program it's solo, an E tells the program it's additive of the precending address (meaning they are melded together, 1+1=1, 1+0 = 1, 0+1 = 1, 0+0 = 0 style) a D tells the program they are additive of the preceding address (meaning they are melded together 1+1=1, 1+0 = 0, 0+1 = 1, and 0+0 = 0 style) so on and so forth. it is possible to chain together 10 address links per the program. Later revisions may find a better way, but testing is required. Also the program should have a native function that recognizes strings.
F0 00 00 00 00 10 01 E0 00 00 00 00 00 10 02
for instance could be reduced to = F0 00 00 00 00 10 03
Naturally decompression of such a system would require very little real mathematic properties, simple search and translation routines. Meaning processing of these streams would be very fast for any system.
The upside to the entire thing = Data will become a fraction of a size smaller - RAM would be magnitudes bigger than ROM - and the simplistic nature really puts multiple core systems to work.
The downside - there would have to be a major facility dedicated to archiving each and every result of the system for use. The system is repeatable accross the 1/PI, e, and 1/e universes. and 80MB vs 20MB for system usage is a drop in the bucket for the ability to blend PI templates with e and 1/PI templates together for even more results.
The way the results are stored within the system would be simple as well. We'll just have to rely on hash checksum for pre-matching of inputs. let's say ADLER32 and CRC32 just for simple matching. That's up to 70 bytes for the addresss + 8 bytes for the Checksums for every value found by the system.
I'm sure by now i've probably lost you again, but that's all my thinking in text. whether anyone else can understand it but me, is a question i cannot answer. Again thank you for your time.
-WD
When i first started playing around with this i wanted to create a compression method that used the idea - that all pieces of data will eventually be found within PI at some point.
And the best way for me to describe the sensation of wanting to create something like that would be this example:
Think of a single 3 letter word - car.
Now in your mind, think of every single possible thing you can think about relating them all to that one single word. Images, Video, Names, parts, you can generate billions and billions of pieces of knowledge and draw them from your mind, from one single word. Now i'm not sure if that's true compression, but it is a system of addressing that our own minds use to draw out meaningful data, related to a single thought. Now if the human mind is just a bio-chemical computer, it should be systematically possible to do the same thing with a normal computer system. Now when we "draw" data from our minds, we aren't thinking of mathematical equations to do it, we're simply doing it. No middleman.
Think of all the information in your mind, streaming from a finite stream of data held not within individual memory cells in your mind. But from code stored directly in your DNA. Now that code may just be a simple equation that is processed out via the synapsis in your brain like a normal computer processing binary. But when it makes a match to something stored within that code to an actual thought you've had - then the location within the code where that thought was found is stored, not the actual thought - hense the recall mechanism. which sadly is not perfect - you may notice a recalled memory is malleable - meaning it doesn't always contain the whole truth - just what you recall on the fly - now the more you concentrate on it (the more you let the patterns fill in from the code) the clearer your memory of an event may become. The next point to that is the human minds "boot-mode" direct memory access = hypnosis. Which allows the mind to stop all other functions but basic ones - and concentrate on regenerating an image or an instance stored to near 100% quality. So even the powerful human mind - needs a near stop to process out the code completely.
We have a few benefits over the human mind:
1) all data generated from this system is pure, no "imagined" information
2) no background programs running (the 5 senses, etc) while trying to process information
3) a systematic approach to the entire system, instead of on the fly storage - as found with the mind (sometimes short-term memory will fail to make it completely into long term) meaning the human mind has limited short term (ram) and quite a bit of Long term (rom). And the process of uploading the ram to rom is REM sleep.
I'm sure there are an infiinte amount of other variables in the human brain scenario, but that's just my take on why i wanted to create this system like this.
So basically the first draft of this idea looked at only decimal digit randomization - and by looking at the data obtained by kanada labratories (the japanese lab that calculated PI out to the 2.3 trillionth digit) every single digit appears at roughly the same frequency, no matter how big the sample size they chose was. Which was perfect for my idea. Also the fact that PI has way more sample data to pick through than any other sequence.
So basically i found myself needing a way to isolate the frequence of each indivicual decimal digit. And that was easy, just use a binary switch scenario. 0 for off, 1 for on. I found that i could create 1022 unique keys of any number of sample sizes within PI. 32, 64, 128, 256 etc digits. Now, the reason i've limited myself to only 40 million digits as a beginning frame for the entire program is because it takes a normal home computer, running at 2 ghz, a little under 10 minutes to manifest all 40 million digits from the equation flawlessly. Meaning if a program where ever distributed it could simply be the program that calculated 40 million digits of PI into memory. (20MB) Also the fact that i realized i could use around 7 bytes per sample (8192 digit sample)
00 00 00 00 00 00 00
16x16x16x16x16x16x16 = 268,435,456
So if i went to the first digit in PI past the decimal, and wanted the sample for code 281 (hex) out of 3FE (hex) possible codes. My address would look like
F0 00 00 00 00 12 81
The F at the beginning is the control Bit, it's purpose is to tell the program whether the address is solo, or joint. an F tells the program it's solo, an E tells the program it's additive of the precending address (meaning they are melded together, 1+1=1, 1+0 = 1, 0+1 = 1, 0+0 = 0 style) a D tells the program they are additive of the preceding address (meaning they are melded together 1+1=1, 1+0 = 0, 0+1 = 1, and 0+0 = 0 style) so on and so forth. it is possible to chain together 10 address links per the program. Later revisions may find a better way, but testing is required. Also the program should have a native function that recognizes strings.
F0 00 00 00 00 10 01 E0 00 00 00 00 00 10 02
for instance could be reduced to = F0 00 00 00 00 10 03
Naturally decompression of such a system would require very little real mathematic properties, simple search and translation routines. Meaning processing of these streams would be very fast for any system.
The upside to the entire thing = Data will become a fraction of a size smaller - RAM would be magnitudes bigger than ROM - and the simplistic nature really puts multiple core systems to work.
The downside - there would have to be a major facility dedicated to archiving each and every result of the system for use. The system is repeatable accross the 1/PI, e, and 1/e universes. and 80MB vs 20MB for system usage is a drop in the bucket for the ability to blend PI templates with e and 1/PI templates together for even more results.
The way the results are stored within the system would be simple as well. We'll just have to rely on hash checksum for pre-matching of inputs. let's say ADLER32 and CRC32 just for simple matching. That's up to 70 bytes for the addresss + 8 bytes for the Checksums for every value found by the system.
I'm sure by now i've probably lost you again, but that's all my thinking in text. whether anyone else can understand it but me, is a question i cannot answer. Again thank you for your time.
-WD
#10
Posted 20 January 2009 - 07:35 PM
Here are the problems I see:
1) Hashes have collisions, so a given compressed chunk has no guarantee of having a unique decoding.
2) If the compression/decompression takes longer than downloading a file on dialup, then there has been no meaningful savings.
3) You're talking about 7 bytes for addressing. If you need to store multiple addresses for the file, that can explode. (I'm not clear on this section at all)
4) I still haven't seen a clear indication that your claim of great space savings will work.
5) I'm not certain what is presented could be used as a specification for the algorithm. I'm thinking in terms of: given string XXXXXX in the uncompressed file, locate string YYYYYY in pi and record ZZZZZZ in the compressed file. Then given ZZZZZ in the compressed file index it against YYYYYY in pi to decompress it to XXXXXX.
1) Hashes have collisions, so a given compressed chunk has no guarantee of having a unique decoding.
2) If the compression/decompression takes longer than downloading a file on dialup, then there has been no meaningful savings.
3) You're talking about 7 bytes for addressing. If you need to store multiple addresses for the file, that can explode. (I'm not clear on this section at all)
4) I still haven't seen a clear indication that your claim of great space savings will work.
5) I'm not certain what is presented could be used as a specification for the algorithm. I'm thinking in terms of: given string XXXXXX in the uncompressed file, locate string YYYYYY in pi and record ZZZZZZ in the compressed file. Then given ZZZZZ in the compressed file index it against YYYYYY in pi to decompress it to XXXXXX.
#11
Posted 21 January 2009 - 04:06 AM
Response:
1) Hashes will only be used as a device in the main database for quick searching - they will not actually be compressing anything.
2) compression will not happen on a user end system - it will only be done on files submitted for compression. Decompression is the only thing end users will be able to do - and they will be able to do it very quickly.
3) You will only need 7 bytes of addressing per 1kb in a file, up to 70 bytes per 1kb in a file, so a 1MB file will use between 7168 bytes and 71680 bytes for addressing. The addresses themselves can still be compressed normally (in fact once they are compressed you can attempt to run the system again on the resulting compressed file)
4) I'm sorry i'm trying to be as clear as possible, perhaps i should type out a full example.
5) Your thinking of a mathematical means to program this. When there is no real math invovled. Sure there are a number of formulas at work, but the transformation of the data doesn't involve any direct mathematical manipulation.
-WD
1) Hashes will only be used as a device in the main database for quick searching - they will not actually be compressing anything.
2) compression will not happen on a user end system - it will only be done on files submitted for compression. Decompression is the only thing end users will be able to do - and they will be able to do it very quickly.
3) You will only need 7 bytes of addressing per 1kb in a file, up to 70 bytes per 1kb in a file, so a 1MB file will use between 7168 bytes and 71680 bytes for addressing. The addresses themselves can still be compressed normally (in fact once they are compressed you can attempt to run the system again on the resulting compressed file)
4) I'm sorry i'm trying to be as clear as possible, perhaps i should type out a full example.
5) Your thinking of a mathematical means to program this. When there is no real math invovled. Sure there are a number of formulas at work, but the transformation of the data doesn't involve any direct mathematical manipulation.
-WD
#12
Posted 21 January 2009 - 07:12 AM
5) I'm not thinking mathematical, but algorithmic.


Sign In
Create Account


Back to top









