Data Structure Basics pt.4

Table of Contents

Review

So far, in all previous data structures we’ve either use a numeric index or no index at all, like in stacks and queues. The index ones usually are zero-based and are good at accessing it’s own objects. Those data structures are great for keeping data in order. The downside is that we need to know exactly where in the collection is the object we need. Which forces us to search for it. Unfortunately these data structures aren’t the best for searching for items. To learn the best way to search those data structures we need to talk about efficient search algorithms and that’s beyond the scope of this series of articles.

For now let’s talk about data structures that are better designed for searching.

Associative Arrays

Associative arrays give us the ability to use meaningful keys to access each value inside the collection. This effectively creates a meaningful relationship between those two making them a pair. That means that no matter how we sort the pairs inside a collection keys will always return the same values. Actually, since each key must be unique, there is no need to sort them. Values can be almost anything and it’s not required to be unique.

Hash Values

A basic understanding of how hashing works is necessary to understand hash tables.

We can think of a hashing as in cooking. First we chop our ingredients. Then we mix them together until we get a homogeneous dough. Similarly, in order to produce the hash value for an object we take some of its parts and return a simplified integer representation of the whole object. This is very useful in data structures because we can easily find data therein.

Hashing is not encryption. Hashing functions are typically one-way since information gets lost in the process.

Some hashing rules:

  • Hashing should be deterministic under the same context; should always return the same hash value for a given object.
  • Two objects that are equal should return the same hash.
  • Two different objects may return the same hash.

Hashing Collision

Although most languages know how to deal with hashing collision we should be aware of its implications. Imagine we have a hashing function that turns every letter in a given object into a number based on their place in the alphabet. Numbers stay the same, all other characters get discarded. Finally, it returns their sum. Given the object Sam Jones 04/04/1990 our function would compute it like 19 + 1 + 13 + 10 + 15 + 14 + 5 + 4 + 4 + 1990 and return 2075. If we were to pass Fay Adams 10/10/1985 to our function it would return the same number.

So a hash collision happens when we have at least two objects with different data but the same hash value.

Hash Values in Ruby

All instances respond to #hash which returns their hash value. However, when writing our own classes we should override it with our own implementation as to guarantee that the == behaves as expected specially if we override it.

Hash Tables

The main advantage of hash tables over arrays and linked lists is the speed in which they can find an object. They are also good at adding and deleting object.

Hash tables can be either fixed or flexible. In either case, hash tables are usually initialized with buckets waiting for content. A ‘bucket’ is how we call entries in a hash table.

Hash tables, behind the scenes, behave like arrays. When we add a new bucket to a hash table, the key gets passed through a hashing function. Usually the hash value gets reduced, and the number returned is the index of the array behind the scenes. A similar thing happens when we search for an item, and like we saw back in arrays, as long as we know the index of the item we are looking for the size and flexibility of an array are irrelevant.

Managing Collisions

No matter how good a hashing function is, its bound to create a single hash value for more than one object. Hash tables implement ways to deal with it. For instance, the Separate Chaining Technique; whenever there’s a collision the bucket’s value turns into a linked list or an array. Other algorithms include: open addressing, Robin Hood hashing, and 2-choice hashing. But they are beyond the scope of this series.

Associative Arrays in Ruby

Hash tables are implemented in Ruby via the Hash class. Do not confuse it the #hash method mentioned above.

Sets

This is probably the simplest data structure to use, its usually:

  • An unordered collection of objects.
  • No index, sequence or key is set to access an object.
  • No duplicates allowed.
  • Fast lookup - for checking existence, not retrieval.

Behind the scenes, sets behave similarly to hash tables. The main difference is that the object itself gets passed through the hashing function. The hash value is the index of the bucket where it will be stored. That’s also the reason why there can’t be duplicates and why they are really fast at finding.

The reason why sets are not good at retrieving objects. If we think it through, it makes no sense to try to retrieve the very same object we are using to find it. Instead a set is really fast at letting us know whether it contains an element or not.

Sets in Ruby

Ruby has a built-in implementation of sets. Is a hybrid of Array’s intuitive inter-operation facilities and Hash’s fast lookup. For further details check the docs.

Before we go

This time we covered the basics on associative arrays. In the next article in this series we will talk about tree data structures.