Introduction

The glmnetLRC package enables lasso and elastic-net logistic regression classification (LRC) with an arbitrary loss function. It makes it easy to construct a binary classifier from virtually any number of quantitative predictors that will assign an example, or observation, to one of two classes.

glmnetLRC extends the glmnet package by making it possible to train lasso or elastic-net LRC’s using a customized, discrete loss function to measure the classification error. This allows users to assign unique loss values to false positive and false negative errors. The logistic regression parameter estimates are obtained by maximizing the elastic-net penalized likelihood function that contains several tuning parameters. These tuning parameters are estimated by minimizing the expected loss, which is calculated using cross validation.

You can find the standard help files for package functions by following the bold blue links.

–Landon Sego and Alexander Venzin

Package installation

Begin by installing dependencies from CRAN:

install.packages(c("devtools", "glmnet", "plyr"))

The Smisc package (which is imported by glmnetLRC) contains C code and requires compilation. To do this

  • on a Mac, you’ll need Xcode
  • on Windows, you’ll need to install R tools
  • on Linux, compilation should take place “automatically”

With the compilation tools in place, you can now install the Smisc and glmnetLRC packages from the PNNL Github site as follows:

devtools::install_github("pnnl/Smisc")
devtools::install_github("pnnl/glmnetLRC")

Now load the package as usual:

library(glmnetLRC)

Vignette

The methods in the glmnetLRC package were originally implemented to automate the process of determining the curation quality of mass spectrometry samples (Amidan, et al 2014). Those same data will be used here to demonstrate how to train your own classifier. In the sections that follow, we show how to use the glmnetLRC package to train LRC models, extract coefficients, predict the binary class of new observations, summarize the performance of those predictions, and create diagnostic plots.

Preliminaries

Let’s begin by loading the package and the training data:

# Load the VOrbitrap Shewanella QC data
data(traindata)

# A view of first two rows and first 12 columns
traindata[1:2, 1:12]
      Instrument_Category Instrument Dataset_ID Acq_Time_Start Acq_Length
pt701           VOrbitrap VOrbiETD03     251690     12/31/2011         98
pt702           VOrbitrap VOrbiETD03     251706       1/1/2012         98
                                          Dataset Dataset_Type
pt701 QC_Shew_11_06_col2A_30Dec11_Cougar_11-10-11      HMS-MSn
pt702 QC_Shew_11_06_col2C_30Dec11_Cougar_11-10-11      HMS-MSn
      Curated_Quality XIC_WideFrac XIC_FWHM_Q1 XIC_FWHM_Q2 XIC_FWHM_Q3
pt701            good     0.297090     19.3820     21.1900     24.3149
pt702            good     0.305519     19.3785     21.1812     24.3262
# Columns 9 to 96 contain various measures of dataset quality that
# we will use to predict the "Curated_Quality"
predictors <- as.matrix(traindata[,9:96])

We fit the LRC model by calling glmnetLRC(), which requires a binary response variable, coded as a factor. The order in which the response variable is coded is important. Specifically, the class we want to predict with the greatest sensitivity should be encoded as the second level. To illustrate how this is done, consider the Shewanella QC data, where the objective is to be sensitive to predicting poor datasets. Hence we code poor last, as follows:

response <- factor(traindata$Curated_Quality,
                   levels = c("good", "poor"),
                   labels = c("good", "poor"))

levels(response)
[1] "good" "poor"

Using lossMatrix(), we can define a discrete loss matrix. For the curation of dataset quality, predicting good when the dataset is poor is considerably worse (Loss = 5) than predicting poor when the dataset is good (Loss = 1). Correct predictions have zero loss:

# Define the loss matrix
lM <- lossMatrix(c("good","good","poor","poor"),
                 c("good","poor","good","poor"),
                 c(     0,     1,     5,     0))

# Observe the structure of the loss matrix
lM
           Predicted.good Predicted.poor
Truth.good              0              1
Truth.poor              5              0

Training

To train an elastic-net model, the user needs to supply a handful of arguments to glmnetLRC(). The mandatory arguments are the true class labels, truthLabels (which, in this case, is, is the response object we created above) and the matrix of predictor variables, predictors. Noteworthy additional arguments include

  • the loss matrix lossMat, if you want something other than 0-1 loss;
  • tauVec, a vector of potential values of the threshold parameter τ ∈ (0, 1) that are used to dichotomize the predicted probabilities from the logistic regression into two class labels;
  • alphaVec, a vector of potential values of the elastic-net mixing parameter α ∈ [0, 1], which governs the tradeoff between the two regularization penalties. When α = 0, L2 regularization (the ridge penalty) is used, and when α = 1, L1 regularization (the lasso penalty) is used;
  • cvFolds, the number of cross validation folds;
  • cvReps, the number of times the cross validation process is repeated with a different random partition of the data; and
  • nJobs, the number of parallel jobs to run during the training process.

Heavier sampling of tauVec or alphaVec (i.e., sequences of greater length) leads to increased computation time, but more of the parameter space will be sampled, potentially leading to a better classifier.

We now call glmnetLRC(). Be advised that if you are actually running the code shown below, it will take a while. Fewer values of alphaVec or tauVec and a smaller value of cvReps will make it run faster.

# Set the number of cores to be one less than the number available
njobs <- max(1, parallel::detectCores() - 1)

# Fit the LRC model
glmnetLRC_fit <- glmnetLRC(response, predictors, lossMat = lM, 
                           estimateLoss = TRUE, nJobs = njobs)

The call to glmnetLRC() uses cross validation to solve for the optimal parameter settings (α, λ, τ) that minimize the expected loss for the elastic-net LRC. Printing the resulting object shows the median value for the parameters over the cross validation replicates, as well as the average and standard deviation of the expected loss values calculated for each cross validation replicate:

print(glmnetLRC_fit)
The optimal parameter values for the elastic-net logistic regression fit: 
     Df      %Dev alpha    lambda  tau mean.ExpectedLoss sd.ExpectedLoss
[1,] 12 0.7343156     1 0.0140946 0.15         0.1553846      0.02144056

We can also extract the non-zero coefficients of the elastic-net logistic regression model that was created using the optimal values of α and λ (which were shown by the call to the print() method above):

coef(glmnetLRC_fit)
   (Intercept)   XIC_WideFrac  XIC_Height_Q3     MS1_TIC_Q3     MS1_TIC_Q4 
  9.025824e+00  -2.624710e+01   1.678408e+00   7.354373e-02   2.669031e-01 
     MS2_Count MS2_Density_Q1           C_4A          IS_1A         MS1_2A 
 -6.683437e-05  -1.325541e-03   7.124788e-02   8.513326e-03   9.438797e-04 
          P_1A           P_1B           P_2B 
 -5.924279e-02   1.249349e-02  -6.508943e-04 

Prediction

Now that the classifier has been properly trained and the optimal parameters have been identified, we are interested in making predictions for new data observations. This requires the elastic-net regression model (the output from glmnetLRC()) and the set of new observations to be predicted, newdata. Note that newdata must contain all the columns (with equivalent names) that were used to train the LRC. If true labels are available in newdata, the column containing these true class labels can be specified via the truthCol argument. Additionally, one may wish to carry through a subset of the explanatory variables in newdata. These columns are indicated using keepCols. True labels are not required to make predictions—but they are required to compute performance metrics (sensitivity, specificity, etc.) for the LRC. We begin by testing the classifier by predicting the data used to train it:

# Predict the training data
predictTrain <- predict(glmnetLRC_fit, traindata, 
                        truthCol = "Curated_Quality", keepCols = 1:2)

# Look at beginning of the predicted data.  Note the extra columns that were 
# kept:  "Instrument_Category" and "Instrument"
head(predictTrain)
           Prob PredictClass Curated_Quality Instrument_Category
pt701 0.1642061         poor            good           VOrbitrap
pt702 0.1385627         good            good           VOrbitrap
pt703 0.0524000         good            good           VOrbitrap
pt704 0.3173680         poor            good           VOrbitrap
pt706 0.9917682         poor            poor           VOrbitrap
pt707 0.9948105         poor            poor           VOrbitrap
      Instrument
pt701 VOrbiETD03
pt702 VOrbiETD03
pt703 VOrbiETD03
pt704 VOrbiETD03
pt706 VOrbiETD02
pt707 VOrbiETD02

The Prob column above is the probability predicted by the logistic regression model that the observation is a poor dataset.

We can summarize the performance of the classifier predictions with a call to the summary() method. The performance metrics are oriented in terms of being sensitive to predicting a poor dataset. Thus, a false positive is predicting a dataset to be poor when it is good, and a false negative is predicting a dataset to be good when it is poor. This orientation resulted from us setting poor as the second level in response.

# Summarize the performance of the new classifier in terms of a variety of metrics:
summary(predictTrain)
Logistic regression classification performance:

                          poor
sensitivity         0.97979798
specificity         0.90265487
false negative rate 0.02020202
false positive rate 0.09734513
accuracy            0.92615385

Summary of predicted probabilities, by true class:

TrueClass: good
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.007506 0.022510 0.036520 0.080950 0.067650 0.802500 
-------------------------------------------------------- 
TrueClass: poor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.02915 0.75690 0.90930 0.81520 0.99050 0.99960 

Now let’s bring in some new data and examine the performance of the classifier:

# Load the data for testing
data(testdata)

# Create table observing the true number of good/poor items 
with(testdata, table(Curated_Quality))
Curated_Quality
good poor 
  38   61 
# Predict new data
predictTest <- predict(glmnetLRC_fit, testdata, truthCol = "Curated_Quality")

# Look at the first few rows
head(predictTest)
           Prob PredictClass Curated_Quality
931  0.93749105         poor            good
1449 0.05542810         good            good
1467 0.03719369         good            good
1468 0.03208644         good            good
1470 0.02964643         good            good
1501 0.13626848         good            good
# Summarize the output of predicting the test data
summary(predictTest)
Logistic regression classification performance:

                          poor
sensitivity         0.95081967
specificity         0.84210526
false negative rate 0.04918033
false positive rate 0.15789474
accuracy            0.90909091

Summary of predicted probabilities, by true class:

TrueClass: good
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000252 0.0222400 0.0387600 0.1156000 0.1096000 0.9375000 
-------------------------------------------------------- 
TrueClass: poor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.03867 0.45480 0.87730 0.71300 0.99270 1.00000 

We can also create a plot of the probabilities produced by the logistic regression model. The more separated the probabilities of the good and poor datasets, the better the classifier:

plot(predictTest, scale = 0.5, legendArgs = list(x = "topright"))

If we don’t include a truth column in the call to predict(), the summary() method provides a numeric summary of the predicted probabilities and counts the number of observations assigned to each category by the LRC:

predictTest1 <- predict(glmnetLRC_fit, testdata)
summary(predictTest1)
Counts of predicted classes:

good poor 
  35   64 

Summary of predicted probabilities, by predicted class:

PredictedClass: good
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
2.518e-05 2.135e-02 3.396e-02 4.941e-02 7.507e-02 1.385e-01 
-------------------------------------------------------- 
PredictedClass: poor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1614  0.4629  0.8701  0.7212  0.9881  1.0000 

And the plot() method shows the predicted probabilities without distinguishing between the truly good or poor datasets:

plot(predictTest1, scale = 0.5)

Diagnostics

Finally, we would like to get a sense of the distribution of the tuning parameters that were chosen during the cross validation phase. The plot() method produces a 3 × 3 scatterplot matrix of the optimal triples (α, λ, τ) associated with the selected logistic regression model from each cross validation replicate. Therefore, each point represents the optimal estimate of (α, λ, τ) for a given cross validation replicate. The univariate distribution of each parameter is plotted on the diagonal of the scatterplot matrix. Ideally, the distributions of the parameters will be tight over the cross validation replicates, indicating that the choice of (α, λ, τ) is stable regardless of the particular random partition used for cross validation.

plot(glmnetLRC_fit)

Mathematical Details

A mathematical description of the details in the glmnetLRC package can be found here.

Citation

Please cite the glmnetLRC package using the following reference:

Sego LH, Venzin AM, Ramey JA. 2016. glmnetLRC: Lasso and Elastic-Net Logistic Regression Classification (LRC) with an arbitrary loss function in R. Pacific Northwest National Laboratory. http://pnnl.github.io/glmnetLRC.

Acknowledgements

This package was developed with support from the Signature Discovery Initiative at Pacific Northwest National Laboratory, conducted under the Laboratory Directed Research and Development Program at PNNL, a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy.