Perl Hashes, Bloom Filters, and You
Peter Bowen - Thousand Oaks Perl Mongers
October 21, 2015
Perl Hashes
How do they Work
Attribution
Totally Ripped off from LA.pm in early October 2015
Problem...
- We need a good way to quickly look up a value based on a key.
Arrays to the rescue
my @array = ( ... );
my $index = 100;
my $value = $array[$index];
Not what I want!
- Arrays only support integer indexes.
- I want to be able to use any data, Strings, objects, whatever!
Linked List?
contains(assoc_list, key) {
while (assoc_list && assoc_list.key != key)
assoc_list = assoc_list.next
return assoc_list;
}
get(assoc_list, key) {
node = contains(assoc_list, key)
if (node) return node.value
else return Null
}
store(&assoc_list, key, value) {
node = contains(assoc_list, key)
if (node) node.value = value
else {
newnode = new Node(key, value);
newnode.next = assoc_list
assoc_list = newnode
}
}
Isn't that good enough?
- Scale...
- All operations run in O(n)
- Insering n items takes O(n2)
- Not practical for very large lists
- Everyone knows how to fix this...
Trees!
- Faster!
- Typically O(log n) search time.
- Inserts are faster at O(n log n) time.
- But wait...
- Worst case is O(n)
- Surprisingly Common
- Sort the keys before storing
- Stringy trees are slower to search than linked lists
- Just use and algorythm to fix it!
2-3 Tree
Red Black Tree
AVL Tree
AARGH!
Are you kidding me?
Is there an easier way?
- The solution should be:
- Fast
- Simple to understand and troubleshoot
- Flexible
Hash Tables!
- Algorythms are MUCH easier to understand
- Better Average Behavior
- Worst Case is still the same
Keys
- Almost always strings
- Non-strings are often converted to strings - hopefully unique
- Not really useful without unique keys
How does it work?
- Create an array of reasonable size
- Hash the key creating an integer (Magic)
- Put it into the right bucket
But Wait...
- How do you create the hashed value?
- What happens when more than one key hashes?
- What happens when you run out of buckets?
- Is it performant?
Hashing Algorithm
- Find a good pseudo random number generator
- Seed it with the key
- Generate a random number
- Divide it by the number of buckets
- The remainder is the index
Code
my %hash;
#Give Perl a hint if you know how big it is...
# keys(%hash) = 1_000_000;
foreach my $number ( 1 .. 1_000 ) {
$hash{$number}++;
print $number, " : ", scalar %hash, "\n";
}
Optimizing the Randomizer
- Specific Data Types
- Generally not worth your time
Collisions
- When Storing:
- Use a list
- Append the new item to the front of the list
- When Retreiving:
- Find the right bucket
- Traverse the list looking for the key match
Scaling the Table
- Double the size of the table
- Redistribute the data
- Perl stores the hashed value with the key
- Doubling the table decreases the frequency
-
Performance
- Store and Retrieve are at O(log n)
- Worst Case is O(n) - Although extemely rare
- Rebuilding takes O(n)
Bloom Filters
Problem
I have a value that I need to compare to a large set of values.
Use an Array!
- Ummmm... have to traverse the array
A better way
Can I quickly eliminate a whole class of values as not being in the list before I have to go and search for them?
Bloom Filter
Space/Time Trade-offs in Hash Coding with Allowable Errors
- Rapidly eliminate a value from a set
- Burton H. Bloom - 1970
How does it work
- Building the Filter
- Create a large binary list
- Get two hashing algorythms
- Apply both values to the binary list
How does it work?
- Apply both algorythms to the value
- If they're not both there, then false
- If they are there, then use the longer check