Perl Hashes, Bloom Filters, and You

Peter Bowen - Thousand Oaks Perl Mongers

October 21, 2015

Perl Hashes

How do they Work

Attribution

Totally Ripped off from LA.pm in early October 2015

Problem...

  • We need a good way to quickly look up a value based on a key.

Arrays to the rescue


    my @array = ( ... );
    my $index = 100;
    my $value = $array[$index];
    

Not what I want!

  • Arrays only support integer indexes.
  • I want to be able to use any data, Strings, objects, whatever!

Linked List?


contains(assoc_list, key) {
    while (assoc_list && assoc_list.key != key)
        assoc_list = assoc_list.next
    return assoc_list;
}

get(assoc_list, key) {
    node = contains(assoc_list, key)
    if (node) return node.value
    else      return Null
}

store(&assoc_list, key, value) {
    node = contains(assoc_list, key)
    if (node) node.value = value
    else {
        newnode = new Node(key, value);
        newnode.next = assoc_list
        assoc_list = newnode
    }
}

Isn't that good enough?

  • Scale...
    • All operations run in O(n)
    • Insering n items takes O(n2)
  • Not practical for very large lists
  • Everyone knows how to fix this...

Trees!

Trees!

  • Faster!
    • Typically O(log n) search time.
    • Inserts are faster at O(n log n) time.
  • But wait...
    • Worst case is O(n)
    • Surprisingly Common
      • Sort the keys before storing
      • Stringy trees are slower to search than linked lists
      • Just use and algorythm to fix it!

2-3 Tree

Red Black Tree

AVL Tree

AARGH!

Are you kidding me?

  • So Complicated

Is there an easier way?

  • The solution should be:
    • Fast
    • Simple to understand and troubleshoot
    • Flexible

Hash Tables!

Hash Tables!

  • Algorythms are MUCH easier to understand
  • Better Average Behavior
  • Worst Case is still the same
    • Rare - Extremely Rare

Keys

  • Almost always strings
  • Non-strings are often converted to strings - hopefully unique
  • Not really useful without unique keys

How does it work?

  1. Create an array of reasonable size
  2. Hash the key creating an integer (Magic)
  3. Put it into the right bucket

But Wait...

  • How do you create the hashed value?
  • What happens when more than one key hashes?
  • What happens when you run out of buckets?
  • Is it performant?

Hashing Algorithm

  1. Find a good pseudo random number generator
  2. Seed it with the key
  3. Generate a random number
  4. Divide it by the number of buckets
  5. The remainder is the index

Notes

Code


    my %hash;

    #Give Perl a hint if you know how big it is...
    # keys(%hash) = 1_000_000;

    foreach my $number ( 1 .. 1_000 ) {
        $hash{$number}++;
        print $number, " : ", scalar %hash, "\n";
    }

Optimizing the Randomizer

  • Specific Data Types
  • Generally not worth your time

Collisions

  • When Storing:
    1. Use a list
    2. Append the new item to the front of the list
  • When Retreiving:
    1. Find the right bucket
    2. Traverse the list looking for the key match

Scaling the Table

  1. Double the size of the table
  2. Redistribute the data
  • Perl stores the hashed value with the key
  • Doubling the table decreases the frequency

Performance

  • Store and Retrieve are at O(log n)
  • Worst Case is O(n) - Although extemely rare
  • Rebuilding takes O(n)

Bloom Filters

Problem

I have a value that I need to compare to a large set of values.

Use an Array!

  • Ummmm... have to traverse the array

Use a Hash!

  • Ummm... Memory Usage?

A better way

Can I quickly eliminate a whole class of values as not being in the list before I have to go and search for them?

Bloom Filter

Space/Time Trade-offs in Hash Coding with Allowable Errors

  • Rapidly eliminate a value from a set
  • Burton H. Bloom - 1970

How does it work

  • Building the Filter
    1. Create a large binary list
    2. Get two hashing algorythms
    3. Apply both values to the binary list

How does it work?

  1. Apply both algorythms to the value
  2. If they're not both there, then false
  3. If they are there, then use the longer check

Example

Bloomfilter Example