 Michael Quinn

Nerd dad who spends way too much time thinking about statistical computing, face-punching and other adventurous things.

Going Old School with the Spam Email Dataset

Introduction

John Marden was a Professor Emeritus when I attended UIUC. While I didn’t have the chance to attend any of his classes, he had a huge influence on my academic career. He is the author of an incredible series of “Old School” courses notes on a variety of topics in statistical analysis.

In honor of Dr. Marden, we’re doing fitting some Old School classification models to an Old School Machine learning dataset. I’ll be looking at two models today: linear discriminant analysis (LDA) and Naive Bayes (NB). I’ll touch a bit on the theory behind each model and the classification problem itself. Most importantly, I’ve developed code to implement each model, as I think both are excellent learning opportunities.

In that regard, this blog now has a package. You can find it on Github, and you can install the most recent version with devtools.

Code is a means of communication, and good code is well-documented code. The easiest way to do that was just put the package together and throw in Roxygen headers for each function. That said, the blog’s sister package is only meant to illustrate the exercises on the blog. It is permanently a work in progress, and nothing was written for perfect application in every context. Caveat emptor.

In today’s post, I’ll be working with the Hewlett-Packard Spam Data. It’s available in the UCI Repository Of Machine Learning Databases. While it’s not that old as far as standard statistical datasets are concerned (the data were published in 1999), it’s already considered a classic classification dataset. The familiarity of the data should help us focus on the real issue at hand, the classification algorithms.

The Problem

A Formal Description of Classification

Consider a multivariate dataset that consists of a $n \times p$ set of predictors $\mathbf{X}$, with $p$ columns corresponding to different variables and $n$ rows for each observation. We assume that each row is independently distributed. For classification problems, each observation belongs to one of $K$ groups described by the length-$n$ vector $y$.

Terminology is always a bit of a problem, but I’ll do my best to adhere to the terminology outlined by Hastie, Tibishirani and Friedman in The Elements of Statistical Learning. The vector $y$ is our outcome or target, and our problem is one of supervised learning. In the spam dataset, an observation can belong to either the “spam” or “not spam” group, meaning that our problem is binary.

A prediction is ultimately a function. Given a single observation of $\mathbf{X}$, let’s call is $x_j$ we want a function that produces our best estimate of the target $\hat{y}_j$. Formally, we want $\mathrm{G}()$ such that

What do we mean by “best” estimate of the target? Well, that’s a bit of a tricky question. Informally, we could define “best” as the estimator that gets us closest to the target. For whatever reason, statisticians prefer to formalize this statement through a loss function ($L$), which is a statement about our error. Here, we’ll use $0-1$ loss:

Our risk is the expected value of our loss function,

Expanding for a finite number of observations $n$, we get

If you dig through a Math Stats book for awhile, you’ll come across the property that the expectation of an indicator function is just the probability of the event. Let’s plug that in.

Minimizing our risk is equivalent to minimizing the probability statement in the left-hand side. Moreover, this minimization problem is equivalent to maximizing a slightly different probability statement. Thanks to the complement property of probabilities,

This gives us what we’re ultimately looking for. We want a function that maximizes the probability that our prediction is the same as the target, given the data that we have. Formally, that’s

In a multi-class problem, we would need to calculate the probability for each class separately and then pick accordingly. In the binary case, finding the max is much easier. Let $\mathrm{G} (x_j) = 1$ if

We can also just treat this as a ratio. This has some nice mathematical properties once we implement our different classifiers.

The fraction on the right-hand-side is known as an expression of odds. For a variety of reasons, it is often much easier to work with the log odds. This implies that $\mathrm{G} (x_j) = 1$ if

The Naive Bayes Classifier

A lot of the differences in our models come from how we evaluate the probability statements that form the log odds. For example, we can apply Bayes Rule. By that, I mean

In the numerator of the expression above, $P(y_i = 1)$ is called the prior probability. It is a statement about the distribution of the groups within our data. The other numerator component, $P(\mathbf{X} = x_i | y_i = 1)$, is called the likelihood. In Bayesian statistics, we often get to ignore the denominator, so we often say that the posterior is proportional to the prior times the likelihood.

Since we’re basing our classification rule on a ratio, we don’t even need to worry about normalizing constants. This is a big advantage of evaluating the odds instead of comparing the individual probability statements. Our problems simplifies to $\mathrm{G} (x_i) = 1$ if

For the Naive Bayes classifier, we assume that each column of $\mathbf{X}$ is normally distributed and independent. This is often too general of an assumption to make, but Naive Bayes tends to perform well regardless. This means that the $j$-th column of the $\mathbf{X}$ matrix now has the following pdf:

Assuming that the columns are independently distributed gives us an easy shortcut to calculating the likelihood in the definition of Bayes Rule above: the likelihood is now just the product of the pdf of each column. Formally,

Big products like this are often hard to work with, but applying the log takes the product and turns it into a sum. Plugging everything back into the odds ratio and simplifying, we get $\mathrm{G} (x_i) = 1$ if

where $p_k$ are the group proportions, $\mu_{kj}$ are the group means for each column and $\sigma_{kj}^2$ are the column variances. This is a pretty easy classification rule to implement.

Linear Discriminant Analysis

So that’s Naive Bayes. Remember, the form the of the classifier is based almost entirely on our assumptions about how we’ll evaluate the log odds. This isn’t the only viable set of assumptions we could apply. Often, Naive Bayes is too general. Instead, we could assume that $\mathbf{X}$ follows a multivariate (MV) normal distribution. That is,

The pdf for the MV normal distribution is similar to the univariate normal, with some additional matrix algebra.

Where $| \Sigma |$ is the determinant of the covariance matrix. Keep in mind that in the above equation we’re dealing with vectors and matrices, not scalars.

With this assumption, we can implement Linear or Quadratic Discriminant Analysis (LDA or QDA). The former has been around for almost 100 years now. Like most of Statistics, it was created by RA Fisher. You can read the original article online, along with everything else published in the very unfortunately named Annals of Eugenics. The name’s so bad that Wiley has to throw up a disclaimer. Ouch.

The methods differ over assumptions about the covariance matrix used in the problem. In LDA, we assume that different groups have the same covariance. We do not make this assumption in QDA. Grouped covariance makes the math in LDA a lot simpler, it also saves us from estimating a bunch of extra parameters. For that reason, QDA tends to overfit and you need a lot of data ( even more than the spam data ) to get improved performance. I’ll show that in this post, but I won’t actually implement QDA (the remaining proof is left as an exercise for the student…).

To estimate the covariance matrix in LDA, we use a pooled covariance:

Where $\mathbf{X}_{ki}$ is the $i$-th observation within the $k$-th group, and $\mu_k$ is the group mean.

Returning to the log odds and plugging in the MV normal pdf, we get the following classification rule. In LDA, this is usually referred as the discriminant function $d_1(x)$.

That’s still pretty ugly. Luckily, there are plenty of opportunities for simplification. Dropping out like terms and carrying the logarithm through the division, we get

Separating the terms that depend on $x$ from those that don’t, we get

Or more simply,

where

and

In multivariate statistics, a quadratic form is the matrix operation $a’ X a$ where $a$ is a length $n$ vector and $X$ is an $n \times n$ matrix. The calculation of $b$ above uses two of them. This expression will come in handy later.

If you haven’t noticed already, the expression $d_1(x) = a’ x + b$ follows the canonical definition of a hyperplane. In discriminant analysis it is often referred to as the separating plane. If none of those words mean anything to you, don’t worry, it’s just the multivariate definition of the classical linear equation $y = ax + b$. Formally speaking, the separating plane maximizes the ratio of two sums of squares, $d_R$,

Where $B$ is the between-group sum of squares

And $W$ is the within-group sum of squares, which is essentially the pooled covariance estimate that we described above.

In the two dimensional binary case, it looks something like this. For now, ignore the function LDA and just focus on the image. I’ll talk about it a lot more soon. We have two color-coded classes with a discriminating line running between them. Observations on one side of the line would be classified as one group and vice versa. Obviously, this is not a perfect classifier, but it is nonetheless optimal for a linear split.

Implementing Each Classifier

You can view the complete code for each classifier on Github. Since they’re a little on the long side, I’ll save you all from posting the complete code here. Here’s LDA, and here’s Naive Bayes. Both functions depend on the purrr package (thanks Hadley!) for some functional programming tools. You can learn about it here. If I had to pick a favorite package, it would probably be it.

The implementation of each classifier follows a similar strategy:

• Use model.frame to convert the formula input into a standard data frame. The target is always the first column and the remainder is always the set of predictors.
• Perform some assertions and clean up factors. I created a simple function to_number to convert a factor to numeric. It uses purrr::compose to chain as.character and as.numeric.
• Get the components of each equation: prior probabilities, means and variances.
• Last map the components to the data matrix and collect the results.

The mapping function does most of the work for each classification method. In LDA, the mapper calculates the equation of the discriminating plane.

In Naive Bayes, the mapper calculates the log odds with the column means and variances for each group.

The function log_odds function calcates the kernel function for each class class_dist and adds the log of the variance ratio.

A Test

So, which function is a better fit for our spam problem? Well, we’re statisticians. Let’s create a test.

• Randomly split spam_data into test and training components
• Fit each model to the former and score on the latter
• Repeat many, many times (or 100 times. That’s many, many. Right?)

Since this is a classification problem, I used the F1 score as a performance metric. This score is the harmonic mean of two separate metrics:

• precision the proportion of cases predicted to be true that are actually true
• recall the proportion of true cases classified as true.

In code, we implement this with three simple equations.

Before we get to the test, we should check out the spam dataset in detail. Let’s pull it in from the web and coerce the target into a factor. We’ll be comparing our methods against implementations within caret, and that package requires factors for classification.

An EDA should show some potential pitfalls in the test.

mean sd min p25 median p75 max
word_freq_make 0.1046 0.3054 0 0.000 0.000 0.000 4.540
word_freq_address 0.2130 1.2906 0 0.000 0.000 0.000 14.280
word_freq_all 0.2807 0.5041 0 0.000 0.000 0.420 5.100
word_freq_3d 0.0654 1.3952 0 0.000 0.000 0.000 42.810
word_freq_our 0.3122 0.6725 0 0.000 0.000 0.380 10.000
word_freq_over 0.0959 0.2738 0 0.000 0.000 0.000 5.880
word_freq_remove 0.1142 0.3914 0 0.000 0.000 0.000 7.270
word_freq_internet 0.1053 0.4011 0 0.000 0.000 0.000 11.110
word_freq_order 0.0901 0.2786 0 0.000 0.000 0.000 5.260
word_freq_mail 0.2394 0.6448 0 0.000 0.000 0.160 18.180
word_freq_receive 0.0598 0.2015 0 0.000 0.000 0.000 2.610
word_freq_will 0.5417 0.8617 0 0.000 0.100 0.800 9.670
word_freq_people 0.0939 0.3010 0 0.000 0.000 0.000 5.550
word_freq_report 0.0586 0.3352 0 0.000 0.000 0.000 10.000
word_freq_addresses 0.0492 0.2588 0 0.000 0.000 0.000 4.410
word_freq_free 0.2488 0.8258 0 0.000 0.000 0.100 20.000
word_freq_business 0.1426 0.4441 0 0.000 0.000 0.000 7.140
word_freq_email 0.1847 0.5311 0 0.000 0.000 0.000 9.090
word_freq_you 1.6621 1.7755 0 0.000 1.310 2.640 18.750
word_freq_credit 0.0856 0.5098 0 0.000 0.000 0.000 18.180
word_freq_your 0.8098 1.2008 0 0.000 0.220 1.270 11.110
word_freq_font 0.1212 1.0258 0 0.000 0.000 0.000 17.100
word_freq_000 0.1016 0.3503 0 0.000 0.000 0.000 5.450
word_freq_money 0.0943 0.4426 0 0.000 0.000 0.000 12.500
word_freq_hp 0.5495 1.6713 0 0.000 0.000 0.000 20.830
word_freq_hpl 0.2654 0.8870 0 0.000 0.000 0.000 16.660
word_freq_george 0.7673 3.3673 0 0.000 0.000 0.000 33.330
word_freq_650 0.1248 0.5386 0 0.000 0.000 0.000 9.090
word_freq_lab 0.0989 0.5933 0 0.000 0.000 0.000 14.280
word_freq_labs 0.1029 0.4567 0 0.000 0.000 0.000 5.880
word_freq_telnet 0.0648 0.4034 0 0.000 0.000 0.000 12.500
word_freq_857 0.0470 0.3286 0 0.000 0.000 0.000 4.760
word_freq_data 0.0972 0.5559 0 0.000 0.000 0.000 18.180
word_freq_415 0.0478 0.3294 0 0.000 0.000 0.000 4.760
word_freq_85 0.1054 0.5323 0 0.000 0.000 0.000 20.000
word_freq_technology 0.0975 0.4026 0 0.000 0.000 0.000 7.690
word_freq_1999 0.1370 0.4235 0 0.000 0.000 0.000 6.890
word_freq_parts 0.0132 0.2207 0 0.000 0.000 0.000 8.330
word_freq_pm 0.0786 0.4347 0 0.000 0.000 0.000 11.110
word_freq_direct 0.0648 0.3499 0 0.000 0.000 0.000 4.760
word_freq_cs 0.0437 0.3612 0 0.000 0.000 0.000 7.140
word_freq_meeting 0.1323 0.7668 0 0.000 0.000 0.000 14.280
word_freq_original 0.0461 0.2238 0 0.000 0.000 0.000 3.570
word_freq_project 0.0792 0.6220 0 0.000 0.000 0.000 20.000
word_freq_re 0.3012 1.0117 0 0.000 0.000 0.110 21.420
word_freq_edu 0.1798 0.9111 0 0.000 0.000 0.000 22.050
word_freq_table 0.0054 0.0763 0 0.000 0.000 0.000 2.170
word_freq_conference 0.0319 0.2857 0 0.000 0.000 0.000 10.000
char_freq_; 0.0386 0.2435 0 0.000 0.000 0.000 4.385
char_freq_( 0.1390 0.2704 0 0.000 0.065 0.188 9.752
char_freq_[ 0.0170 0.1094 0 0.000 0.000 0.000 4.081
char_freq_! 0.2691 0.8157 0 0.000 0.000 0.315 32.478
char_freq_\$ 0.0758 0.2459 0 0.000 0.000 0.052 6.003
char_freq_# 0.0442 0.4293 0 0.000 0.000 0.000 19.829
capital_run_length_average 5.1915 31.7294 1 1.588 2.276 3.706 1102.500
capital_run_length_longest 52.1728 194.8913 1 6.000 15.000 43.000 9989.000
capital_run_length_total 283.2893 606.3479 1 35.000 95.000 266.000 15841.000

The spam dataset is pretty sparse. As it turns out, this will cause a few problems for our Naive Bayes classifier. If we randomly split the dataset for cross validation, there is a nontrivial chance that one or more of our columns with be constant and have zero variance. You can’t divide by zero, and we’ll get an NaN returned.

I added a couple functions to check for zero variance columns and remove them from a data frame. This was added to the function naive_bayes().

Our target is close to balanced. This is very helpful in classification, and we shouldn’t have to worry about a skewed target resulting from random sampling.

As a check for our implementation for each method, I’ll pull out the same classification methods in the caret packaged. For all intents and purposes, this is the gold standard for predictive modeling in R. It has a wonderul site as well, which is a thorough tutorial on predictive modeling as much as it is package documentation. If we compete well with caret, we’ve done a good job. caret has LDA, QDA and Naive Bayes. We’ll use all three, along with the option of removing zero variance columns from our training data.

Since we’ll be calling all of these models against the same datasets, it’s easiest to just store these as a list and invoke them. We’ll add another function called score_model() with methods for our methods and caret’s default train class. This will standardize all of our outputs and make them easier to compare.

We need a simple helper function to split the dataset.

And the function rerun in purrr will generate all 100 splits for us.

The actual test is a relatively simple pipeline. We take our list of data frames. We call each of the models with the combination of map and invoke_map. The results in a list of lists, where each item in the top level of the list contains a list of five fit models. With that, we map again. We takes the test data out of dfs and use the models in the list of lists to get our scores.

The standardized outputs of score_model, which has methods for all of the classes of models involved, makes sure that everything is eventually consistent. It’s the grease that makes the whole machine run.

So, what did we get? Right now, we have a pretty messy nested list. To clean this up, we need to make some data frames and bind them together. After that, we can put together a little plot. Since we’re comparing results across distinct categories (the models), a boxplot is best. Our version of LDA performed exactly the same as caret’s. YAY! Either one is the best, but I’m pretty sure you know which one I love more.

Our version of Naive Bayes lagged a little, but it’s way ahead of caret’s QDA and Naive Bayes. Why’s that? It took awhile to diagnose, but it turns out that caret’s default correction for zero variance is pretty aggressive. So much so that it adversely affects the model’s performance. We have much tighter standards for zero variance columns, and end up rejecting a much smaller number of them.

Moreover, the version of Naive Bayes used by caret comes from the package klaR, which does not use log odds in its classification rule. There’s probably a good reason for this, but the result is something that doesn’t do that well with very sparse data frames. We get a long stream of warnings about probabilities close to 0 or 1 is the odds. It ultimately has to round, which introduces a host of problems.

We could expand on this test by throwing in some other methods, but that’s probably enough for today. Please, take the time to check out the implementation of each method in the blog’s package. It’s surprisingly simple and straightforward. And please don’t hesitate to let me know your thoughts.