Lasso and Elastic-Net Logistic Regression Classification with an Arbitrary Loss Function

Authors: Landon H. Sego [aut, cre],Alexander M. Venzin [ctb],John A. Ramey [ctb]

Version: 0.1.8

License: file LICENSE

Description

Methods for fitting and predicting lasso and elastic-net logistic regression classifiers (LRC) with an arbitrary loss function for the classification error.

Depends

(none)

Imports

glmnet, plyr, Smisc

Suggests

parallel, testthat

Enhances

(none)

Functions

glmnetLRC

Construct a lasso or elastic-net logistic regression classifier (LRC) with an arbitrary loss function

This function extends the glmnet and cv.glmnet functions from the glmnet package. It uses cross validation to identify optimal elastic-net parameters and a threshold parameter for binary classification, where optimality is defined by minimizing an arbitrary, user-specified discrete loss function.

Usage

glmnetLRC(truthLabels, predictors, lossMat = "0-1", lossWeight = rep(1,
  NROW(predictors)), alphaVec = seq(0, 1, by = 0.2), tauVec = seq(0.1, 0.9,
  by = 0.05), cvFolds = 5, cvReps = 100, stratify = FALSE,
  masterSeed = 1, nJobs = 1, estimateLoss = FALSE, verbose = FALSE, ...)

## S3 method for class 'glmnetLRC'
print(x, verbose = TRUE, ...)

## S3 method for class 'glmnetLRC'
plot(x, ...)

## S3 method for class 'glmnetLRC'
coef(object, tol = 1e-10, ...)

## S3 method for class 'glmnetLRC'
predict(object, newdata, truthCol = NULL,
  keepCols = NULL, ...)

## S3 method for class 'glmnetLRC'
missingpreds(object, newdata, ...)

## S3 method for class 'glmnetLRC'
extract(object, ...)

Arguments

truthLabels
A factor with two levels containing the true labels for each observation. If it is more desirable to correctly predict one of the two classes over the other, the second level of this factor should be the class you are most interested in predicting correctly.
predictors
A matrix whose columns are the explanatory regression variables. Note: factors are not currently supported. To include a factor variable with n levels, it must be represented as n-1 dummy variables in the matrix.
lossMat
Either the character string “0-1”, indicating 0-1 loss, or a loss matrix of class lossMat, produced by lossMatrix, that specifies the penalties for classification errors.
lossWeight
A vector of non-negative weights used to calculate the expected loss. The default value is 1 for each observation.
alphaVec
A sequence in [0, 1] designating possible values for the elastic-net mixing parameter, α. A value of α = 1 is the lasso penalty, α = 0 is the ridge penalty. Refer to glmnet for further information.
tauVec
A sequence of τ threshold values in (0, 1) for the logistic regression classifier. For a new observation, if the predicted probability that the observation belongs to the second level of truthLabels exceeds tau, the observation is classified as belonging to the second level.
cvFolds
The number of cross validation folds. cvFolds = length(truthLabels) gives leave-one-out (L.O.O.) cross validation, in which case cvReps is set to 1 and stratify is set to FALSE.
cvReps
The number of cross validation replicates, i.e., the number of times to repeat the cross validation by randomly repartitioning the data into folds and estimating the tuning parameters. For L.O.O. cross validation, this argument is set to 1 as there can only be one possible partition of the data.
stratify
A logical indicating whether stratified sampling should be used to ensure that observations from both levels of truthLabels are proportionally present in the cross validation folds. In other words, stratification attempts to ensure there are sufficient observations of each level of truthLabels in each training set to fit the model. Stratification may be required for small or imbalanced data sets. Note that stratification is not performed for L.O.O (when cvFolds = length(truthLabels)).
masterSeed
The random seed used to generate unique (and repeatable) seeds for each cross validation replicate.
nJobs
The number of cores on the local host to use in parallelizing the training. Parallelization takes place at the cvReps level, i.e., if cvReps = 1, parallelizing would do no good, whereas if cvReps = 2, each cross validation replicate would be run separately in its own thread if nJobs = 2. Parallelization is executed using parLapplyW() from the Smisc package.
estimateLoss
A logical, set to TRUE to calculate the average loss estimated via cross validation using the optimized parameters (α, λ, τ) to fit the elastic net model for each cross validation fold. This can be computationally expensive, as it requires another cross validation pass through the same partitions of the data, but using only the optimal parameters to estimate the loss for each cross validation replicate.
verbose
For glmetLRC, a logical to turn on (or off) messages regarding the progress of the training algorithm. For the print method, if set to FALSE, it will suppress printing information about the glmnetLRC object and only invisibly return the results.
x
For the print and plot methods: an object of class glmnetLRC (returned by glmnetLRC()), which contains the optimally-trained elastic-net logistic regression classifier.
object
For the coef, predict, and extract methods: an object of class glmnetLRC (returned by glmnetLRC()) which contains the optimally-trained elastic-net logistic regression classifier.
tol
A small positive number, such that coefficients with an absolute value smaller than tol are not returned.
newdata
A dataframe or matrix containing the new set of observations to be predicted, as well as an optional column of true labels. newdata should contain all of the column names that were used to fit the elastic-net logistic regression classifier.
truthCol
The column number or column name in newdata that contains the true labels, which should be a factor (and this implies newdata should be a dataframe if truthCol is provided). Optional.
keepCols
A numeric vector of column numbers (or a character vector of column names) in newdata that will be kept and returned with the predictions. Optional.
For glmnetLRC(), these are additional arguments to glmnet in the glmnet package. Certain arguments of glmnet are reserved by the glmnetLRC package and an error message will make that clear if they are used. In particular, arguments that control the behavior of α and λ are reserved. For the plot method, the “…” are additional arguments to the default S3 method pairs. And for the print, coef, predict, missingpreds, and extract methods, the “…” are ignored.

Details

For a given partition of the training data, cross validation is performed to estimate the optimal values of α (the mixing parameter of the ridge and lasso penalties) and λ (the regularization parameter), as well as the optimal threshold, τ, which is used to dichotomize the probability predictions of the elastic-net logistic regression model into binary outcomes. (Specifically, if the probability an observation belongs to the second level of truthLabels exceeds τ, it is classified as belonging to that second level). In this case, optimality is defined as the set of parameters that minimize the risk, or expected loss, where the loss function created using lossMatrix. The expected loss is calculated such that each observation in the data receives equal weight

glmnetLRC() searches for the optimal values of α and τ by fitting the elastic-net model at the points of the two-dimensional grid defined by alphaVec and tauVec. For each value of α, the vector of λ values is selected automatically by glmnet according to its default arguments. The expected loss is calculated for each (α, λ, τ) triple, and the triple giving rise to the lowest risk designates the optimal model for a given cross validation partition, or cross validation replicate, of the data.

This process is repeated cvReps times, where each time a different random partition of the data is created using its own seed, resulting in another optimal estimate of (α, λ, τ). The final estimate of (α, λ, τ) is given by the respective medians of those estimates. The final elastic-net logistic regression classfier is given by fitting the regression coefficients to all the training data using the optimal (α, λ, τ).

The methodology is discussed in detail in the online package documentation.

Value

An object of class glmnetLRC, which inherits from classes lognet and glmnet. It contains the object returned by glmnet that has been fit to all the data using the optimal parameters (α, λ, τ). It also contains the following additional elements:
lossMat
The loss matrix used as the criteria for selecting optimal tuning parameters

parms
A data fame that contains the tuning parameter estimates for (α, λ, τ) that minimize the expected loss for each cross validation replicate. Used by the plot method.

optimalParms
A named vector that contains the final estimates of (α, λ, τ), calculated as the element-wise median of parms

lossEstimates
If estimateLoss = TRUE, this element is a data frame with the expected loss for each cross validation replicate

Methods (by generic)

  • print: Displays the overall optimized values of (α, λ, τ), with the corresponding degrees of freedom and deviance for the model fit to all the data using the optimzed parameters. If estimateLoss = TRUE when glmnetLRC() was called, the mean and standard deviation of the expected loss are also shown. In addition, all of this same information is returned invisibly as a matrix. Display of the information can be suppressed by setting verbose = FALSE in the call to print.

  • plot: Produces a pairs plot of the tuning parameters (α, λ, τ) and their univariate histograms that were identified as optimal for each of of the cross validation replicates. This can provide a sense of the stability of the estimates of the tuning parameters.

  • coef: Calls the predict method in glmnet on the fitted glmnet object and returns a named vector of the non-zero elastic-net logistic regression coefficients using the optimal values of α and λ.

  • predict: Predict (or classify) new data from an glmnetLRC object. Returns an object of class LRCpred (which inherits from data.frame) that contains the predicted probabilities (Prob) and class (predictClass) for each observation. The Prob column corresponds to the predicted probability that an observation belongs to the second level of truthLabels. The columns indicated by truthCol and keepCols are included if they were requested. The LRCpred class has two methods: summary.LRCpred and plot.LRCpred.

  • missingpreds: Identify the set of predictors in a glmnetLRC object that are not present in newdata. Returns a character vector of the missing predictor names. If no predictors are missing, it returns character(0).

  • extract: Extracts the glmnet object that was fit using the optimal parameter estimates of (α, λ). Returns an object of class “lognet” “glmnet” that can be passed to various methods available in the glmnet package.

References

Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Venzin AM, Smith RD, Sego LH, Tardiff MF, Payne SH. 2014. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research. 13(4), 2215-2222. http://pubs.acs.org/doi/abs/10.1021/pr401143e

Friedman J, Hastie T, Tibshirani R. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 33(1), 1-22.

Examples

# Load the VOrbitrap Shewanella QC data from Amidan et al.
data(traindata)

# Here we select the predictor variables
predictors <- as.matrix(traindata[,9:96])

# The logistic regression model requires a binary response
# variable. We will create a factor variable from the
# Curated Quality measurements. Note how we put "poor" as the
# second level in the factor.  This is because the principal
# objective of the classifer is to detect "poor" datasets
response <- factor(traindata$Curated_Quality,
                   levels = c("good", "poor"),
                   labels = c("good", "poor"))

# Specify the loss matrix. The "poor" class is the target of interest.
# The penalty for misclassifying a "poor" item as "good" results in a
# loss of 5.
lM <- lossMatrix(c("good","good","poor","poor"),
                 c("good","poor","good","poor"),
                 c(     0,     1,     5,     0))

# Display the loss matrix
lM

# Train the elastic-net classifier (we don't run it here because it takes a long time)

glmnetLRC_fit <- glmnetLRC(response, predictors, lossMat = lM, estimateLoss = TRUE,
                           nJobs = parallel::detectCores())


# We'll load the precalculated model fit instead
data(glmnetLRC_fit)

# Show the optimal parameter values
print(glmnetLRC_fit)

# Show the coefficients of the optimal model
coef(glmnetLRC_fit)

# Show the plot of all the optimal parameter values for each cross validation replicate
plot(glmnetLRC_fit)

# Extract the 'glmnet' object from the glmnetLRC fit
glmnetObject <- extract(glmnetLRC_fit)

# See how the glmnet methods operate on the object
plot(glmnetObject)

# Look at the coefficients for the optimal lambda
coef(glmnetObject, s = glmnetLRC_fit$optimalParms["lambda"] )

# Load the new observations
data(testdata)

# Use the trained model to make predictions about
# new observations for the response variable.
new <- predict(glmnetLRC_fit, testdata, truthCol = "Curated_Quality", keepCols = 1:2)
head(new)

# Now summarize the performance of the model
summary(new)

# And plot the probability predictions of the model
plot(new, scale = 0.5, legendArgs = list(x = "topright"))

# If predictions are made without an indication of the ground truth,
# the summary is necessarily simpler:
summary(predict(glmnetLRC_fit, testdata))

See also

summary.LRCpred, a summary method for objects of class LRCpred, produced by the predict method.

Author

Landon Sego, Alex Venzin

summary.LRCpred

Summarize predictions of logistic regression classifier

Summarize the predicted probabilities of the classifier, and, if possible, calculate accuracy, sensitivity, specificity, false positive rate, and false negative rate.

Usage

## S3 method for class 'LRCpred'
summary(object, ...)

## S3 method for class 'summaryLRCpred'
print(x, ...)

Arguments

object
An object of class LRCpred returned by predict.glmnetLRC.
x
An object of class summaryLRCpred.
Arguments passed to print methods. Ignored by the summary method.

Value

Returns a summaryLRCpred object. If truthCol was provided in the call to predict.glmnetLRC, the result is a list with the following elements:

ConfusionMatrixMetrics
A matrix with the sensitivity, specificity, false negative rate, false positive rate, and accuracy for the class designated by the second level of the truthLabels argument provided to glmnetLRC

PredProbSummary
A numeric summary of the predicted probabilities, according to the true class

If truthCol was not provided in the call to predict.glmnetLRC, the result is a list with the following elements:

PredClassSummary
A tabulation of the number of predictions in each class

PredProbSummary
A numeric summary of the predicted probabilities, according to the predicted class

Methods (by generic)

  • print: Prints a summaryLRCpred object in a readable format.

See also

See glmnetLRC for examples.

Author

Landon Sego

plot.LRCpred

Plot the predictions of logistic regression classifier

Usage

## S3 method for class 'LRCpred'
plot(x, pch = c(1, 2), col = c("Blue", "Red"),
  scale = 1, seed = 1, parArgs = NULL, legendArgs = NULL,
  lineArgs = NULL, textArgs = NULL, ...)

Arguments

x
an object of class LRCpred returned by predict.glmnetLRC.
pch
A vector of at most length 2 indicating the plotting symbols to be used to differentiate the two true classes. If truthCol was not specified in the call to predict.glmnetLRC, only the first element is used. This is passed to plot.
col
A vector of at most length 2 indicating the colors of the plotted points in order to differentiate the two true classes. If truthCol was not specified in the call to predict.glmnetLRC, only the first element is used. This is passed to plot.
scale
A numeric value in (0, 1] that controls scaling of the horizontal axis. A value of 1 corresponds to the standard, linear scale. Values closer to 0 symetrically zoom-in the axis near 0 and 1 while zooming-out the axis in the neighborhood of 0.5. Values of scale closer to 0 are useful if most of the probability predictions are piled up near 0 and 1.
seed
Single numeric value that sets the seed for the random jitter of the vertical axis of the plot.
parArgs
If desired, a list of named arguments that will be passed to par which is called prior to making the plot.
legendArgs
If desired, a list of named arguments that will be passed to legend. If truthCol was not specified in the call to predict.glmnetLRC, no legend is drawn.
lineArgs
If desired, a list of named arguments that will be passed to abline governing the vertical line that indicates the value of τ.
textArgs
If desired, a list of named arguments that will be passed to text governing the text indicating the value of τ.
Arguments passed to plot.default.

Value

A plot showing the predicted probabilities of the logisitic regression classifier, with a vertical bar showing the value of the probability threshold, τ.

See also

See glmnetLRC for an example.

Author

Landon Sego

lossMatrix

Build a loss matrix

Build an arbitrary loss matrix for discrete classification

Usage

lossMatrix(truthLabels, predLabels, lossValues)

## S3 method for class 'lossMat'
print(x, ...)

Arguments

truthLabels
character vector of truth labels
predLabels
character vector of corresponding predicted labels, which must be the same length as truthLabels
lossValues
numeric vector of corresponding loss values, which must be the same length as truthLabels and predLabels.
x
An object of class lossMat
Additional arguments to print.default

Details

This function checks the inputs and binds the three arguments columnwise into a dataframe.

Value

An object of class lossMat: a dataframe that contains all the information of the loss matrix to be used by in calculating the loss.

Examples

# A 2x2 symmetric loss matrix
lossMatrix(c("a","a","b","b"), c("a","b","a","b"), c(0, 1, 5, 0))

# A 3x2 asymmetric loss matrix
lossMatrix(rep(letters[1:3], each = 2), rep(letters[4:5], 3),
           c(0, 3, 2, 0, 1, 0))

# An unbalanced loss matrix with a missing element.
# Not sure why one would want to do this.
lossMatrix(c("a","a","b"), c("a","b","b"), c(0, 1, 0))

Author

Landon Sego

Data

testdata

Test data set from the Vorbitrap mass spectrometry instrument

Format

A dataframe with 99 observations and 96 variables

References

Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Venzin AM, Smith RD, Sego LH, Tardiff MF, Payne SH. 2014. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research. 13(4), 2215-2222. http://pubs.acs.org/doi/abs/10.1021/pr401143e

Examples

data(testdata)
str(testdata)

traindata

Training data set from the Vorbitrap mass spectrometry instrument

Format

A dataframe with 325 observations and 99 variables

References

Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Venzin AM, Smith RD, Sego LH, Tardiff MF, Payne SH. 2014. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research. 13(4), 2215-2222. http://pubs.acs.org/doi/abs/10.1021/pr401143e

Examples

data(traindata)
str(traindata)

glmnetLRC_fit

A glmnetLRC model fit object

This object is returned by the call glmnetLRC_fit <- glmnetLRC(response, predictors, lossMat = lM, estimateLoss = TRUE, nJobs = parallel::detectCores()) in the example of glmnetLRC. It is preserved here in the package because it is time consuming to generate.

Format

A glmnetLRC object returned by glmnetLRC

Examples

# Load fitted glmnetLRC model
data(glmnetLRC_fit)