Title: | Optimized Integer Risk Score Models |
---|---|
Description: | Implements an optimized approach to learning risk score models, where sparsity and integer constraints are integrated into the model-fitting process. |
Authors: | Hannah Eglinton [aut, cre], Alice Paul [aut, cph], Oscar Yan [aut], R Core Team [ctb, cph] (Copyright holder of Rinternals.h, R.h, lm.c, Applic.h, statsR.h, glm package), Robert Gentleman [ctb, cph] (Author and copyright holder of Rinternals.h), Ross Ihaka [ctb, cph] (Author and copyright holder of Rinternals.h), Simon Davies [ctb] (Author of glm.fit function (modified in cv_risk_mod.R)), Thomas Lumley [ctb] (Author of glm.fit function (modified in cv_risk_mod.R)) |
Maintainer: | Hannah Eglinton <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.1 |
Built: | 2024-11-27 04:25:45 UTC |
Source: | https://github.com/hjeglinton/riskscores |
The Breast Cancer Wisconsin dataset from the UCI machine learning repository records the measurements from breast tissue biopsies. The outcome of interest is whether the sample was benign or malignant.
breastcancer
breastcancer
breastcancer
A data frame with 683 rows and 10 columns:
1 for malignant, 0 for benign
Clump thickness on an integer scale from 1 to 10
Uniformity of cell size on an integer scale from 1 to 10
Uniformity of cell shape on an integer scale from 1 to 10
Marginal adhesion on an integer scale from 1 to 10
Single epithelial cell size on an integer scale from 1 to 10
Bare nuclei on an integer scale from 1 to 10
Bland chromatin on an integer scale from 1 to 10
Normal nucleoli on an integer scale from 1 to 10
Mitosis on an integer scale from 1 to 10
https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original
Clip values prior to exponentiation to avoid numeric errors.
clip_exp_vals(x)
clip_exp_vals(x)
x |
Numeric vector. |
Input vector x
with all values between -709.78 and 709.78.
clip_exp_vals(710)
clip_exp_vals(710)
Extracts a vector of model coefficients (both nonzero and zero) from a
"risk_mod" object. Equivalent to accessing the beta
attribute of a
"risk_mod" object.
## S3 method for class 'risk_mod' coef(object, ...)
## S3 method for class 'risk_mod' coef(object, ...)
object |
An object of class "risk_mod", usually a result of a call to
|
... |
Additional arguments. |
Numeric vector with coefficients.
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y, lambda0 = 0.01) coef(mod)
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y, lambda0 = 0.01) coef(mod)
Runs k-fold cross-validation on a grid of values. Records
class accuracy and deviance for each
. Returns an object of
class "cv_risk_mod".
cv_risk_mod( X, y, weights = NULL, beta = NULL, a = -10, b = 10, max_iters = 100, tol = 1e-05, nlambda = 25, lambda_min_ratio = ifelse(nrow(X) < ncol(X), 0.01, 1e-04), lambda0 = NULL, nfolds = 10, foldids = NULL, parallel = FALSE, shuffle = TRUE, seed = NULL )
cv_risk_mod( X, y, weights = NULL, beta = NULL, a = -10, b = 10, max_iters = 100, tol = 1e-05, nlambda = 25, lambda_min_ratio = ifelse(nrow(X) < ncol(X), 0.01, 1e-04), lambda0 = NULL, nfolds = 10, foldids = NULL, parallel = FALSE, shuffle = TRUE, seed = NULL )
X |
Input covariate matrix with dimension |
y |
Numeric vector for the (binomial) response variable. |
weights |
Numeric vector of length |
beta |
Starting numeric vector with |
a |
Integer lower bound for coefficients (default: -10). |
b |
Integer upper bound for coefficients (default: 10). |
max_iters |
Maximum number of iterations (default: 100). |
tol |
Tolerance for convergence (default: 1e-5). |
nlambda |
Number of lambda values to try (default: 25). |
lambda_min_ratio |
Smallest value for lambda, as a fraction of
lambda_max (the smallest value for which all coefficients are zero).
The default depends on the sample size ( |
lambda0 |
Optional sequence of lambda values. By default, the function
will derive the lambda0 sequence based on the data (see |
nfolds |
Number of folds, implied if |
foldids |
Optional vector of values between 1 and |
parallel |
If |
shuffle |
Whether order of coefficients is shuffled during coordinate descent (default: TRUE). |
seed |
An integer that is used as argument by |
An object of class "cv_risk_mod" with the following attributes:
results |
Dataframe containing a summary of deviance and accuracy for each
value of |
lambda_min |
Numeric value indicating the |
lambda_1se |
Numeric value indicating the largest |
Runs k-fold cross-validation on a grid of values
using random warm starts (see risk_mod_random_start. Records
class accuracy and deviance for each
. Returns an
object of class "cv_risk_mod".
cv_risk_mod_random_start( X, y, weights = NULL, a = -10, b = 10, max_iters = 100, tol = 1e-05, nlambda = 25, lambda_min_ratio = ifelse(nrow(X) < ncol(X), 0.01, 1e-04), lambda0 = NULL, nfolds = 10, foldids = NULL, parallel = FALSE, seed = NULL, nstart = 5 )
cv_risk_mod_random_start( X, y, weights = NULL, a = -10, b = 10, max_iters = 100, tol = 1e-05, nlambda = 25, lambda_min_ratio = ifelse(nrow(X) < ncol(X), 0.01, 1e-04), lambda0 = NULL, nfolds = 10, foldids = NULL, parallel = FALSE, seed = NULL, nstart = 5 )
X |
Input covariate matrix with dimension |
y |
Numeric vector for the (binomial) response variable. |
weights |
Numeric vector of length |
a |
Integer lower bound for coefficients (default: -10). |
b |
Integer upper bound for coefficients (default: 10). |
max_iters |
Maximum number of iterations (default: 100). |
tol |
Tolerance for convergence (default: 1e-5). |
nlambda |
Number of lambda values to try (default: 25). |
lambda_min_ratio |
Smallest value for lambda, as a fraction of
lambda_max (the smallest value for which all coefficients are zero).
The default depends on the sample size ( |
lambda0 |
Optional sequence of lambda values. By default, the function
will derive the lambda0 sequence based on the data (see |
nfolds |
Number of folds, implied if |
foldids |
Optional vector of values between 1 and |
parallel |
If |
seed |
An integer that is used as argument by |
nstart |
Number of different random starts to try (default: 5). |
Calculates a risk model's accuracy, sensitivity, and specificity given a set of data.
get_metrics( mod, X = NULL, y = NULL, weights = NULL, threshold = NULL, threshold_type = c("response", "score") )
get_metrics( mod, X = NULL, y = NULL, weights = NULL, threshold = NULL, threshold_type = c("response", "score") )
mod |
An object of class |
X |
Input covariate matrix with dimension |
y |
Numeric vector for the (binomial) response variable. |
weights |
Numeric vector of length |
threshold |
Numeric vector of classification threshold values used to calculate the accuracy, sensitivity, and specificity of the model. Defaults to a range of risk probability thresholds from 0.1 to 0.9 by 0.1. |
threshold_type |
Defines whether the |
Data frame with accuracy, sensitivity, and specificity for each threshold.
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y) get_metrics(mod, X, y) get_metrics(mod, X, y, threshold = c(150, 175, 200), threshold_type = "score")
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y) get_metrics(mod, X, y) get_metrics(mod, X, y, threshold = c(150, 175, 200), threshold_type = "score")
Calculates a risk model's deviance, accuracy, sensitivity, and specificity given a set of data and a threshold value.
get_metrics_internal( mod, X = NULL, y = NULL, weights = NULL, threshold = 0.5, threshold_type = c("response", "score") )
get_metrics_internal( mod, X = NULL, y = NULL, weights = NULL, threshold = 0.5, threshold_type = c("response", "score") )
mod |
An object of class |
X |
Input covariate matrix with dimension |
y |
Numeric vector for the (binomial) response variable. |
weights |
Numeric vector of length |
threshold |
Numeric vector of classification threshold values used to calculate the accuracy, sensitivity, and specificity of the model. Defaults to a range of risk probability thresholds from 0.1 to 0.9 by 0.1. |
threshold_type |
Defines whether the |
List with deviance (dev), accuracy (acc), sensitivity (sens), and specificity (spec).
Returns the risk probabilities for the provided score value(s).
get_risk(object, score)
get_risk(object, score)
object |
An object of class "risk_mod", usually a result of a call to
|
score |
Numeric vector with score value(s). |
Numeric vector with the same length as score
.
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y) get_risk(mod, score = c(1, 10, 20))
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y) get_risk(mod, score = c(1, 10, 20))
Returns the score(s) for the provided risk probabilities.
get_score(object, risk)
get_score(object, risk)
object |
An object of class "risk_mod", usually a result of a call to
|
risk |
Numeric vector with probability value(s). |
Numeric vector with the same length as risk
.
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y) get_score(mod, risk = c(0.25, 0.50, 0.75))
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y) get_score(mod, risk = c(0.25, 0.50, 0.75))
Plots the mean deviance for each tested during cross-validation.
## S3 method for class 'cv_risk_mod' plot(x, ...)
## S3 method for class 'cv_risk_mod' plot(x, ...)
x |
An object of class "cv_risk_mod", usually a result of a call to
|
... |
Additional arguments affecting the plot produced |
Object of class "ggplot".
Plots the linear regression equation associated with the integer risk score model. Plots the scores on the x-axis and risk on the y-axis.
## S3 method for class 'risk_mod' plot(x, score_min = NULL, score_max = NULL, ...)
## S3 method for class 'risk_mod' plot(x, score_min = NULL, score_max = NULL, ...)
x |
An object of class "risk_mod", usually a result of a call to
|
score_min |
The minimum score displayed on the x-axis. The default is the minimum score predicted from model's training data. |
score_max |
The maximum score displayed on the x-axis. The default is the maximum score predicted from model's training data. |
... |
Additional arguments affecting the plot produced |
Object of class "ggplot".
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y, lambda0 = 0.01) plot(mod)
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y, lambda0 = 0.01) plot(mod)
Obtains predictions from risk score models.
## S3 method for class 'risk_mod' predict(object, newx = NULL, type = c("link", "response", "score"), ...)
## S3 method for class 'risk_mod' predict(object, newx = NULL, type = c("link", "response", "score"), ...)
object |
An object of class "risk_mod", usually a result of a call to
|
newx |
Optional matrix of new values for |
type |
The type of prediction required. The default ("link") is on the scale of the predictors (i.e. log-odds); the "response" type is on the scale of the response variable (i.e. risk probabilities); the "score" type returns the risk score calculated from the integer model. |
... |
Additional arguments. |
Numeric vector of predicted values.
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y, lambda0 = 0.01) predict(mod, type = "link")[1] predict(mod, type = "response")[1] predict(mod, type = "score")[1]
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y, lambda0 = 0.01) predict(mod, type = "link")[1] predict(mod, type = "response")[1] predict(mod, type = "score")[1]
Fits an optimized integer risk score model using a cyclical coordinate descent algorithm. Returns an object of class "risk_mod".
risk_mod( X, y, gamma = NULL, beta = NULL, weights = NULL, lambda0 = 0, a = -10, b = 10, max_iters = 100, tol = 1e-05, shuffle = TRUE, seed = NULL )
risk_mod( X, y, gamma = NULL, beta = NULL, weights = NULL, lambda0 = 0, a = -10, b = 10, max_iters = 100, tol = 1e-05, shuffle = TRUE, seed = NULL )
X |
Input covariate matrix with dimension |
y |
Numeric vector for the (binomial) response variable. |
gamma |
Starting value to rescale coefficients for prediction (optional). |
beta |
Starting numeric vector with |
weights |
Numeric vector of length |
lambda0 |
Penalty coefficient for L0 term (default: 0).
See |
a |
Integer lower bound for coefficients (default: -10). |
b |
Integer upper bound for coefficients (default: 10). |
max_iters |
Maximum number of iterations (default: 100). |
tol |
Tolerance for convergence (default: 1e-5). |
shuffle |
Whether order of coefficients is shuffled during coordinate descent (default: TRUE). |
seed |
An integer that is used as argument by |
This function uses a cyclical coordinate descent algorithm to solve the following optimization problem.
These constraints ensure that the model will be sparse and include only integer coefficients.
An object of class "risk_mod" with the following attributes:
gamma |
Final scalar value. |
beta |
Vector of integer coefficients. |
glm_mod |
Logistic regression object of class "glm" (see stats::glm). |
X |
Input covariate matrix. |
y |
Input response vector. |
weights |
Input weights. |
lambda0 |
Imput |
model_card |
Dataframe displaying the nonzero integer coefficients (i.e. "points") of the risk score model. |
score_map |
Dataframe containing a column of possible scores and a column with each score's associated risk probability. |
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod1 <- risk_mod(X, y) mod1$model_card mod2 <- risk_mod(X, y, lambda0 = 0.01) mod2$model_card mod3 <- risk_mod(X, y, lambda0 = 0.01, a = -5, b = 5) mod3$model_card
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod1 <- risk_mod(X, y) mod1$model_card mod2 <- risk_mod(X, y, lambda0 = 0.01) mod2$model_card mod3 <- risk_mod(X, y, lambda0 = 0.01, a = -5, b = 5) mod3$model_card
Runs nstart
iterations of risk_mod()
, each with a different
warm start, and selects the best model. Each coefficient start is
randomly selected as -1, 0, or 1.
risk_mod_random_start( X, y, weights = NULL, lambda0 = 0, a = -10, b = 10, max_iters = 100, tol = 1e-05, seed = NULL, nstart = 5 )
risk_mod_random_start( X, y, weights = NULL, lambda0 = 0, a = -10, b = 10, max_iters = 100, tol = 1e-05, seed = NULL, nstart = 5 )
X |
Input covariate matrix with dimension |
y |
Numeric vector for the (binomial) response variable. |
weights |
Numeric vector of length |
lambda0 |
Penalty coefficient for L0 term (default: 0).
See |
a |
Integer lower bound for coefficients (default: -10). |
b |
Integer upper bound for coefficients (default: 10). |
max_iters |
Maximum number of iterations (default: 100). |
tol |
Tolerance for convergence (default: 1e-5). |
seed |
An integer that is used as argument by |
nstart |
Number of different random starts to try (default: 5). |
Returns a vector of fold IDs that preserves class proportions.
stratify_folds(y, nfolds = 10, seed = NULL)
stratify_folds(y, nfolds = 10, seed = NULL)
y |
Numeric vector for the (binomial) response variable. |
nfolds |
Number of folds (default: 10). |
seed |
An integer that is used as argument by |
Numeric vector with the same length as y
.
y <- rbinom(100, 1, 0.3) foldids <- stratify_folds(y, nfolds = 5) table(y, foldids)
y <- rbinom(100, 1, 0.3) foldids <- stratify_folds(y, nfolds = 5) table(y, foldids)
Prints text that summarizes "risk_mod" objects.
## S3 method for class 'risk_mod' summary(object, ...)
## S3 method for class 'risk_mod' summary(object, ...)
object |
An object of class "risk_mod", usually a result of a call to
|
... |
Additional arguments affecting the summary produced. |
Printed text with intercept, nonzero coefficients, gamma, lambda, and deviance
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y, lambda0 = 0.01) summary(mod)
y <- breastcancer[[1]] X <- as.matrix(breastcancer[,2:ncol(breastcancer)]) mod <- risk_mod(X, y, lambda0 = 0.01) summary(mod)