Hey!
I have a program in which I need to hash based on multiple values. I'll give an example ; say I have 5 arrays :
A = [1,2,3,4,5]
B = [4,7,2,7,3]
C = [1,5,3,4,8]
D = [2,7,1,5,3]
E = [5,4,2,3,1]
I would like to make a function that assigns the same values (hash values) to those arrays, which contain similar numbers at similar locations (if not same values, nearby values).
In the above example, I would want A and C in the same bucket, B and D in the same bucket. However, E would not be in any common bucket as others.
Assume I can make any number of buckets, I have about 1 million arrays, and the number of items in each array is around 100 to 1000, and the values of these items ranges from 1 to any large number.
Basically, I need to hash similar arrays to the same bucket, for locality sensitive checking for similarity of arrays.
Any help would be appreciated, I'm completely lost :| :crying:
Hashing : Need function/algorithm!
Started by ReaveR, May 24 2010 09:06 PM
8 replies to this topic
#1
Posted 24 May 2010 - 09:06 PM
|
|
|
#2
Posted 25 May 2010 - 07:08 PM
Why do you want A and C in the same bucket? The simplest possible hash would be to sum the elements in each array.
#3
Posted 25 May 2010 - 11:37 PM
well A[0] = C[0], A[2] = C[2] and A[3] = C[3], they are pretty similar to each other.
I want to hash similar ones to the same bucket.
The sum method fails, because [0,0,0,0,10] and [1,2,3,4,0] hash to the same bucket, I don't want that at all.
I want to hash similar ones to the same bucket.
The sum method fails, because [0,0,0,0,10] and [1,2,3,4,0] hash to the same bucket, I don't want that at all.
#4
Posted 26 May 2010 - 10:38 AM
There is a huge difference between "x data members the same" and "same hash". Even saying "if two data members match, they should be in the same bucket" doesn't work, because of an example like [1, 2, 3, 4, 5], [1, 2, 6, 7, 8], [7, 8, 9, 10, 11]: The first and second should be in the same bucket, the second and third should be in the same bucket, and the first and third should NOT be in the same bucket.
So the question becomes (again), WHY do you want A and C to hash the same? Is it part of an assignment, or just your own desire? Why would you expect any of such a small sample to have the same hash?
So the question becomes (again), WHY do you want A and C to hash the same? Is it part of an assignment, or just your own desire? Why would you expect any of such a small sample to have the same hash?
#5
Posted 26 May 2010 - 11:15 AM
It's a part of an assignment. It's for locality sensitive hashing.
Basically, I have N (a million, etc.) files, and I have to tell which among them is similar to another. I have an algorithm at the end of which each file will be represented by an array (or a list, I'm using python). Each array will have about 5000+ elements, and then to tell which files are similar, I will have to compare these sets. Now comparing one million arrays in couples would take weeks to finish on the fastest computer, hence I need a non-comparison method to get similar sets hashed into buckets.
Once I have buckets, I only have to check within each bucket. For N files, I will have typically more than N buckets, some will be empty, some will have single files, and some will have multiple files. I will only need to do a check on those buckets which have more than a single file (array) in them, to get my answer. This will cut down the process (in theory) to a few hours at the worst case scenario.
I have already got the sets ready in the form of arrays, now I need to separate the arrays into buckets.
However, the problem is a little simplified because you can assume that at maximum each set has 1500 elements (say).
So now that the problem is clear (sorry for not explaining it in full earlier), I need help :|
:confused::confused::confused::confused:
Basically, I have N (a million, etc.) files, and I have to tell which among them is similar to another. I have an algorithm at the end of which each file will be represented by an array (or a list, I'm using python). Each array will have about 5000+ elements, and then to tell which files are similar, I will have to compare these sets. Now comparing one million arrays in couples would take weeks to finish on the fastest computer, hence I need a non-comparison method to get similar sets hashed into buckets.
Once I have buckets, I only have to check within each bucket. For N files, I will have typically more than N buckets, some will be empty, some will have single files, and some will have multiple files. I will only need to do a check on those buckets which have more than a single file (array) in them, to get my answer. This will cut down the process (in theory) to a few hours at the worst case scenario.
I have already got the sets ready in the form of arrays, now I need to separate the arrays into buckets.
However, the problem is a little simplified because you can assume that at maximum each set has 1500 elements (say).
So now that the problem is clear (sorry for not explaining it in full earlier), I need help :|
:confused::confused::confused::confused:
#6
Posted 26 May 2010 - 11:33 AM
Why not have a "key" array that is the parent of each bucket, and do all comparison with the key arrays. If a new array fails to have a match, it becomes the key array for a new bucket?
#7
Posted 26 May 2010 - 12:29 PM
That is again based on comparisons. However, it is still very helpful, thanks. It reduces the complexity from O(C(n,2)) to O(n^2). However, if there could be any way to make this free of comparisons, then that would be what I am searching for.
Thanks for the solution, I've been brainstorming for so long, can't believe this didn't strike me. Simple and elegant, thanks so much!
PS : Please see if anyone can find a non-comparison solution!
Thanks for the solution, I've been brainstorming for so long, can't believe this didn't strike me. Simple and elegant, thanks so much!
PS : Please see if anyone can find a non-comparison solution!
#8
Posted 26 May 2010 - 12:54 PM
O(C(n,2)) = O(n^2)...
#9
Posted 26 May 2010 - 01:58 PM
Yes, but in this case n^2 will be reduced because of collisions. the first method always makes all the comparisons.
And I do still need some non comparitive way :(
And I do still need some non comparitive way :(


Sign In
Create Account

Back to top









