Authors: Landon H. Sego [aut, cre],Alexander M. Venzin [ctb],John A. Ramey [ctb]
Version: 0.1.8
License: file LICENSE
Methods for fitting and predicting lasso and elastic-net logistic regression classifiers (LRC) with an arbitrary loss function for the classification error.
(none)
glmnet, plyr, Smisc
parallel, testthat
(none)
This function extends the glmnet
and cv.glmnet
functions from the glmnet package. It uses cross validation to identify optimal elastic-net parameters and a threshold parameter for binary classification, where optimality is defined by minimizing an arbitrary, user-specified discrete loss function.
glmnetLRC(truthLabels, predictors, lossMat = "0-1", lossWeight = rep(1,
NROW(predictors)), alphaVec = seq(0, 1, by = 0.2), tauVec = seq(0.1, 0.9,
by = 0.05), cvFolds = 5, cvReps = 100, stratify = FALSE,
masterSeed = 1, nJobs = 1, estimateLoss = FALSE, verbose = FALSE, ...)
## S3 method for class 'glmnetLRC'
print(x, verbose = TRUE, ...)
## S3 method for class 'glmnetLRC'
plot(x, ...)
## S3 method for class 'glmnetLRC'
coef(object, tol = 1e-10, ...)
## S3 method for class 'glmnetLRC'
predict(object, newdata, truthCol = NULL,
keepCols = NULL, ...)
## S3 method for class 'glmnetLRC'
missingpreds(object, newdata, ...)
## S3 method for class 'glmnetLRC'
extract(object, ...)
“0-1”
, indicating 0-1 loss, or a loss matrix of class lossMat
, produced by lossMatrix
, that specifies the penalties for classification errors.
glmnet
for further information.
truthLabels
exceeds tau, the observation is classified as belonging to the second level.
cvFolds = length(truthLabels)
gives leave-one-out (L.O.O.) cross validation, in which case cvReps
is set to 1
and stratify
is set to FALSE
.
1
as there can only be one possible partition of the data.
truthLabels
are proportionally present in the cross validation folds. In other words, stratification attempts to ensure there are sufficient observations of each level of truthLabels
in each training set to fit the model. Stratification may be required for small or imbalanced data sets. Note that stratification is not performed for L.O.O (when cvFolds = length(truthLabels)
).
cvReps
level, i.e., if cvReps = 1
, parallelizing would do no good, whereas if cvReps = 2
, each cross validation replicate would be run separately in its own thread if nJobs = 2
. Parallelization is executed using parLapplyW() from the Smisc package.
TRUE
to calculate the average loss estimated via cross validation using the optimized parameters (α, λ, τ) to fit the elastic net model for each cross validation fold. This can be computationally expensive, as it requires another cross validation pass through the same partitions of the data, but using only the optimal parameters to estimate the loss for each cross validation replicate.
glmetLRC
, a logical to turn on (or off) messages regarding the progress of the training algorithm. For the print
method, if set to FALSE
, it will suppress printing information about the glmnetLRC
object and only invisibly return the results.
print
and plot
methods: an object of class glmnetLRC
(returned by glmnetLRC()
), which contains the optimally-trained elastic-net logistic regression classifier.
coef
, predict
, and extract
methods: an object of class glmnetLRC
(returned by glmnetLRC()
) which contains the optimally-trained elastic-net logistic regression classifier.
tol
are not returned.
newdata
should contain all of the column names that were used to fit the elastic-net logistic regression classifier.
newdata
that contains the true labels, which should be a factor (and this implies newdata
should be a dataframe if truthCol
is provided). Optional.
newdata
that will be kept and returned with the predictions. Optional.
glmnetLRC()
, these are additional arguments to glmnet
in the glmnet
package. Certain arguments of glmnet
are reserved by the glmnetLRC
package and an error message will make that clear if they are used. In particular, arguments that control the behavior of α and λ are reserved. For the plot
method, the “…” are additional arguments to the default S3 method pairs
. And for the print
, coef
, predict
, missingpreds
, and extract
methods, the “…” are ignored.
For a given partition of the training data, cross validation is performed to estimate the optimal values of α (the mixing parameter of the ridge and lasso penalties) and λ (the regularization parameter), as well as the optimal threshold, τ, which is used to dichotomize the probability predictions of the elastic-net logistic regression model into binary outcomes. (Specifically, if the probability an observation belongs to the second level of truthLabels
exceeds τ, it is classified as belonging to that second level). In this case, optimality is defined as the set of parameters that minimize the risk, or expected loss, where the loss function created using lossMatrix
. The expected loss is calculated such that each observation in the data receives equal weight
glmnetLRC()
searches for the optimal values of α and τ by fitting the elastic-net model at the points of the two-dimensional grid defined by alphaVec
and tauVec
. For each value of α, the vector of λ values is selected automatically by glmnet
according to its default arguments. The expected loss is calculated for each (α, λ, τ) triple, and the triple giving rise to the lowest risk designates the optimal model for a given cross validation partition, or cross validation replicate, of the data.
This process is repeated cvReps
times, where each time a different random partition of the data is created using its own seed, resulting in another optimal estimate of (α, λ, τ). The final estimate of (α, λ, τ) is given by the respective medians of those estimates. The final elastic-net logistic regression classfier is given by fitting the regression coefficients to all the training data using the optimal (α, λ, τ).
The methodology is discussed in detail in the online package documentation.
glmnetLRC
, which inherits from classes lognet
and glmnet
. It contains the object returned by glmnet
that has been fit to all the data using the optimal parameters (α, λ, τ). It also contains the following additional elements:
plot
method.
parms
estimateLoss = TRUE
, this element is a data frame with the expected loss for each cross validation replicate
print
: Displays the overall optimized values of (α, λ, τ), with the corresponding degrees of freedom and deviance for the model fit to all the data using the optimzed parameters. If estimateLoss = TRUE
when glmnetLRC()
was called, the mean and standard deviation of the expected loss are also shown. In addition, all of this same information is returned invisibly as a matrix. Display of the information can be suppressed by setting verbose = FALSE
in the call to print
.
plot
: Produces a pairs plot of the tuning parameters (α, λ, τ) and their univariate histograms that were identified as optimal for each of of the cross validation replicates. This can provide a sense of the stability of the estimates of the tuning parameters.
coef
: Calls the predict
method in glmnet
on the fitted glmnet object and returns a named vector of the non-zero elastic-net logistic regression coefficients using the optimal values of α and λ.
predict
: Predict (or classify) new data from an glmnetLRC
object. Returns an object of class LRCpred
(which inherits from data.frame
) that contains the predicted probabilities (Prob
) and class (predictClass
) for each observation. The Prob
column corresponds to the predicted probability that an observation belongs to the second level of truthLabels
. The columns indicated by truthCol
and keepCols
are included if they were requested. The LRCpred
class has two methods: summary.LRCpred
and plot.LRCpred
.
missingpreds
: Identify the set of predictors in a glmnetLRC
object that are not present in newdata
. Returns a character vector of the missing predictor names. If no predictors are missing, it returns character(0)
.
extract
: Extracts the glmnet
object that was fit using the optimal parameter estimates of (α, λ). Returns an object of class “lognet” “glmnet”
that can be passed to various methods available in the glmnet
package.
Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Venzin AM, Smith RD, Sego LH, Tardiff MF, Payne SH. 2014. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research. 13(4), 2215-2222. http://pubs.acs.org/doi/abs/10.1021/pr401143e
Friedman J, Hastie T, Tibshirani R. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 33(1), 1-22.
# Load the VOrbitrap Shewanella QC data from Amidan et al.
data(traindata)
# Here we select the predictor variables
predictors <- as.matrix(traindata[,9:96])
# The logistic regression model requires a binary response
# variable. We will create a factor variable from the
# Curated Quality measurements. Note how we put "poor" as the
# second level in the factor. This is because the principal
# objective of the classifer is to detect "poor" datasets
response <- factor(traindata$Curated_Quality,
levels = c("good", "poor"),
labels = c("good", "poor"))
# Specify the loss matrix. The "poor" class is the target of interest.
# The penalty for misclassifying a "poor" item as "good" results in a
# loss of 5.
lM <- lossMatrix(c("good","good","poor","poor"),
c("good","poor","good","poor"),
c( 0, 1, 5, 0))
# Display the loss matrix
lM
# Train the elastic-net classifier (we don't run it here because it takes a long time)
glmnetLRC_fit <- glmnetLRC(response, predictors, lossMat = lM, estimateLoss = TRUE,
nJobs = parallel::detectCores())
# We'll load the precalculated model fit instead
data(glmnetLRC_fit)
# Show the optimal parameter values
print(glmnetLRC_fit)
# Show the coefficients of the optimal model
coef(glmnetLRC_fit)
# Show the plot of all the optimal parameter values for each cross validation replicate
plot(glmnetLRC_fit)
# Extract the 'glmnet' object from the glmnetLRC fit
glmnetObject <- extract(glmnetLRC_fit)
# See how the glmnet methods operate on the object
plot(glmnetObject)
# Look at the coefficients for the optimal lambda
coef(glmnetObject, s = glmnetLRC_fit$optimalParms["lambda"] )
# Load the new observations
data(testdata)
# Use the trained model to make predictions about
# new observations for the response variable.
new <- predict(glmnetLRC_fit, testdata, truthCol = "Curated_Quality", keepCols = 1:2)
head(new)
# Now summarize the performance of the model
summary(new)
# And plot the probability predictions of the model
plot(new, scale = 0.5, legendArgs = list(x = "topright"))
# If predictions are made without an indication of the ground truth,
# the summary is necessarily simpler:
summary(predict(glmnetLRC_fit, testdata))
summary.LRCpred
, a summary method for objects of class LRCpred
, produced by the predict
method.
Landon Sego, Alex Venzin
Summarize the predicted probabilities of the classifier, and, if possible, calculate accuracy, sensitivity, specificity, false positive rate, and false negative rate.
## S3 method for class 'LRCpred'
summary(object, ...)
## S3 method for class 'summaryLRCpred'
print(x, ...)
LRCpred
returned by predict.glmnetLRC
.
summaryLRCpred
.
print
methods. Ignored by the summary
method.
summaryLRCpred
object. If truthCol
was provided in the call to predict.glmnetLRC
, the result is a list with the following elements:
truthLabels
argument provided to glmnetLRC
If truthCol
was not provided in the call to predict.glmnetLRC
, the result is a list with the following elements:
print
: Prints a summaryLRCpred
object in a readable format.
See glmnetLRC
for examples.
Landon Sego
## S3 method for class 'LRCpred'
plot(x, pch = c(1, 2), col = c("Blue", "Red"),
scale = 1, seed = 1, parArgs = NULL, legendArgs = NULL,
lineArgs = NULL, textArgs = NULL, ...)
LRCpred
returned by predict.glmnetLRC
.
truthCol
was not specified in the call to predict.glmnetLRC
, only the first element is used. This is passed to plot
.
truthCol
was not specified in the call to predict.glmnetLRC
, only the first element is used. This is passed to plot
.
scale
closer to 0 are useful if most of the probability predictions are piled up near 0 and 1.
par
which is called prior to making the plot.
legend
. If truthCol
was not specified in the call to predict.glmnetLRC
, no legend is drawn.
abline
governing the vertical line that indicates the value of τ.
text
governing the text indicating the value of τ.
plot.default
.
See glmnetLRC
for an example.
Landon Sego
Build an arbitrary loss matrix for discrete classification
lossMatrix(truthLabels, predLabels, lossValues)
## S3 method for class 'lossMat'
print(x, ...)
truthLabels
truthLabels
and predLabels
.
lossMat
print.default
This function checks the inputs and binds the three arguments columnwise into a dataframe.
lossMat
: a dataframe that contains all the information of the loss matrix to be used by in calculating the loss.
# A 2x2 symmetric loss matrix
lossMatrix(c("a","a","b","b"), c("a","b","a","b"), c(0, 1, 5, 0))
# A 3x2 asymmetric loss matrix
lossMatrix(rep(letters[1:3], each = 2), rep(letters[4:5], 3),
c(0, 3, 2, 0, 1, 0))
# An unbalanced loss matrix with a missing element.
# Not sure why one would want to do this.
lossMatrix(c("a","a","b"), c("a","b","b"), c(0, 1, 0))
Landon Sego
A dataframe with 99 observations and 96 variables
Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Venzin AM, Smith RD, Sego LH, Tardiff MF, Payne SH. 2014. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research. 13(4), 2215-2222. http://pubs.acs.org/doi/abs/10.1021/pr401143e
data(testdata)
str(testdata)
A dataframe with 325 observations and 99 variables
Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Venzin AM, Smith RD, Sego LH, Tardiff MF, Payne SH. 2014. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research. 13(4), 2215-2222. http://pubs.acs.org/doi/abs/10.1021/pr401143e
data(traindata)
str(traindata)
This object is returned by the call glmnetLRC_fit <- glmnetLRC(response, predictors, lossMat = lM, estimateLoss = TRUE, nJobs = parallel::detectCores())
in the example of glmnetLRC
. It is preserved here in the package because it is time consuming to generate.
A glmnetLRC object returned by glmnetLRC
# Load fitted glmnetLRC model
data(glmnetLRC_fit)