Overview
Introduction
Over at the R Language subreddit, a college student has an interesting problem prompt. The redditor writes:
Hi, the question asks us to create a sample of standard normal distribution, n=100. Then group it into 10 groups of 10 numbers in ascending order (group 1 is the 10 smallest numbers, where group 10 is the 10 largest numbers). I then struggle on this part, “Give a vector of values (let’s call it y) which says which group each observation falls in, in the original vector x. So if the first entry in x falls in the fifth group the vector y will have at first entry value 5 and so on.”
Another helpful redditor links to a question thread over at Stack Overflow which roughly boils down to “just use cut
”. That’s not so fun. Maybe we can tackle the problem differently.
Instead, let’s solve the problem using hash tables. For those not in the know, a hash table is a standard computer science data object that consists of key, value pairs. Python’s dict
objects are an example. Hash tables are usually generated through the application of a hash function, which takes some standard data object and associates it with key values in a useful manner.
In R, we can do all of this with lists. The name will be our key, and the stored object will be our value. It might be a little old fashioned, but “hashing” a vector can teach us a lot about lists in R and functions in R. Many new users might not recognize how powerful these objects actually are.
Building a hash table
Let’s begin. As always, let’s grab a couple packages. As I prefer functional programming in R, we’ll be using magrittr
today.
If you haven’t seen this package before, please check out its wonderful introductory vignette. To cut to the chase, magrittr
implements a “pipe” with the infix function %>%
. The pipe takes some like this
And turns it into this:
Specifically speaking, the pipe moves the left-hand side into the first argument of the right-hand side. If it’s the only argument, you don’t even need to provide parentheses. If you don’t want the left-hand side to be the first argument, use can set the exact spot with a dot .
. It’s pretty amazing, and there are many other functional manipulation tools in the package. Seriously, read the vignette (or at least read it when you’re done here).
Like the Wikipedia article explained, we need a hash function in order to use hash tables. Here’s one.
Our hash function implements an algorithm consisting of three functions.
seq_along
take any vector and generates an integer sequence of the same length.cut
takes a numeric vector and divides it intonbins
intervals. Setting thelabel
argument toFALSE
lets us keep the bin number as the label. This will be the key component of our hash table.split
takes a factor or an object that can be coerced into a factor and divides the vector into groups. We will sort our input vectorx
to make the grouping more meaningful.
This function should output a list of length nbins
with a vector in each bucket. Using cut
has the advantage of controlling overflow. If the modulo division of the length of x
by the number of bins does not equal zero, cut will automatically select which buckets receive the extra values. This saves us from having to write code to check if the vector can be “equally” divided by the number of bins we want.
Searching for values
Now that we can hash numeric vectors, it would be nice to take a value y
and find it in our buckets. First, let’s make an function to check if a value is within a sequence.
Note that we have two different definitions of “within” above. For continuous variables, we would want an idea of “within” that encompasses the range of the variable. For discrete variables, we might prefer a “within” function that checks to see if that exact value is in the bin.
With these functions in hand, let’s apply them to our the table.
The first part of lookup
allows us to use the a string in the arguments to get either of our first two functions. First, we ensure that the user has supplied one of the two acceptable function names, then we use get
to retrieve it. Next vapply
calls the function use each vector in contained in the entries of the list, resulting in a logical vector. which
tells us which entries in the logical vector are TRUE
, and we subtract 1
in order to index from zero. This should make the Computer Scientists happy (I hope you’re happy).
For convenience sake, we should immediately combine the preceding functions into a single call.
Both functions begin with a simple assertion that ensures we get either a hash table, as a list, or everything we need to make one. The important stuff happens in the return line. The first function above allows us to search for an individual value y
in a hash table formed from x
.
We extend that in the second, vectorized, function. Now, we can search for all of the values in a vector, and get all of their hash locations. We keep the assertion and elements to create a hash table in order to avoid creating multiple copies of the hash table when calling a_hash_lookup
with lapply
.
Last but not least, to solve the redditor’s problem, we create another lookup that can find values in a vector against a hashed version of itself. We provide the added option of specifying which values in the vector through the id
argument.
Testing the lookup
Now comes the fun part, actually testing our work. We’ll start with a vector of 100 normally distributed random variables.
Does the hash function work?
Sure looks like it. Even better, we can see how cut
handles overflow. Next up, let’s test the lookup function for single values.
The second expression above, which calls which
tells us the sorted position of x[2]
. It is 31st largest value, since sort
is ascending by default. When we use 10 bins, it is in the 3rd bucket of our hash table (again, we’re indexing from 0. ARE YOU HAPPY!?!?).
Now we make sure that our lookup can handle more than one id
value.
Which it does. Unfortunately, the continuous example masks an unfortunate quick lurking in our code. What if our value matches multiple bins? By default, our function returns all of the matches. For example, take a random sample of integers between 1 and 10.
We can see that the value 4
appears in multiple bins. We might want to create another version of our lookup to trim the results and avoid that behavior. Of course, the type of trimming we want might vary depending on the situation, so we should let the user choose. Some good options are the first, last
Although it would be nice to just pass a simple function to one call of vapply
, we don’t have that option. median
takes only one argument by default, while head
and tail
both take two. To solve this, we use switch
to pull the function we want from a set of possible values.
Now do we get what we want?
Sure looks like it. The output is a vector, and we have ensured no multiple matches.
Let’s do something practical
R is a language primarily designed for statisticians, and so far, nothing has resembled statistics all that much. But there are (inefficient) practical examples of the hash table lookup functions that we built. For example, what if we wanted to bin the numerical vectors in a data frame. Well, now that we have a trimmed version of our lookup function, it’s quite easy!
We drop the fifth column from binning iris
, since it’s a factor. A factor doesn’t necessarily have an implicit order, like numeric values, and so our original hash function doesn’t make much sense. Maybe we can build a different one later.
Nonetheless, for our purposes, we’ve binned the numeric values in iris
. Then again, it would be nice to include the ranges of the values as labels to our binned results as well as save the binned values as factors. After all, bins are often thought of as categorical values. If only there were a function for that. Wait… there’s cut
!
Which answers the redditor’s question! She should probably just use cut
. But at least now we have a full understanding of why!
If we add tidyr::gather
, we’ve surprisingly backed our way into a histogram.
This obviously isn’t the easiest way to get that plot (geom_histogram
is), but it’s nice to have stumbled upon it. In general, cut
should outperform the functions we’ve already written, since calls an Internal function called .bincode
. All of R’s internals are written in C. The difference is pretty extreme.
Still, we’ve essentially built a pure R framework for a function like cut
. More importantly, we’ve had the chance to see some of the amazing flexibility available in R’s lists, and hopefully had a little fun along the way.