The Goby survival data set contains measurements from experimental manipulations on Elacatinus evelynae and E. prochilos in the US Virgin Islands, 2000-2002 (Wilson 2004). The data are compiled from five experiments, \(x = 1,2,...,5\), collected over three years, and include information on animal density \(d\) and habitat quality \(q\). The fraction of surviving Gobies \(T(t)\) are modelled using the Weibull distribution; the aim of the analysis is to investigate the effect of density and habitat quality on mortality rate. Following the analysis of (Bolker 2008) (p276-283), we consider a set of 10 candidate models characterised by the dependence of the scale parameter \(s\) on the variables \(x\), \(d\) and \(q\). The most complex model, denoted (\(\mbox{xqdi}\)), is
\[\begin{eqnarray*} &&T \sim Weibull(a,s_x(d,q))\\[2mm] &&s_x(d,q) = exp(\alpha_x + \beta q + (\gamma + \delta q)d), \end{eqnarray*}\]
with \(s\) depending on \(x,q,d,\) and an interaction \(i\), between \(q\) and \(d\). Models of lower complexity, with various dependencies omitted, are labelled in the same manner, together with (\(\mbox{zero}\)), which denotes the simplest model (a single shape and scale parameter). The data set contains 369 rows.
We load the data from the R package emdbook
(with thanks to Jackie Wilson for giving permission for their use) and make a couple of small preparations
library(emdbook)
data(GobySurvival)
GobySurvival$day1 <- GobySurvival$d1-1
GobySurvival$day2 = ifelse(GobySurvival$d2==70,Inf,GobySurvival$d2-1)
We load a script containing functions to specify, compute and (numerically) optimise the model likelihoods (see GitHub repo).
source("goby_functions.R")
# nll.xqdi computes negative log-likelihood of the full model given data
# nll.pred computes predictive negative log-likelihood for given fitted model and test data
# fit.goby fits all 10 goby survival models using MLE, given data
For example, the function nll.xqdi
returns the negative log-likelihood of the full model (\(\mbox{xqdi}\)):
nll.xqdi
## function (lscale0, lscale.q, lscale.d, lscale.i, lscale.x2, lscale.x3,
## lscale.x4, lscale.x5, lshape)
## {
## lscalediff = c(0, lscale.x2, lscale.x3, lscale.x4, lscale.x5)
## scale = exp(lscale0 + lscalediff[exper] + lscale.q * qual +
## (lscale.d + lscale.i * qual) * density)
## shape = exp(lshape)
## suppressWarnings(-sum(log(pweibull(day2, shape, scale) -
## pweibull(day1, shape, scale))))
## }
The function fit.goby
specifies all 10 models (by fixing subset of the paramters to 0) and fits them using the numerical optimiser bbmle::mle2
, returning a list of the fitted objects. For example, we use the function to fit all models to the full data set and print the fitted object for the (\(\mbox{qd}\)) model:
m.fits <- fit.goby(GobySurvival)
m.fits$qd
##
## Call:
## mle2(minuslogl = nll.xqdi, start = startvals.GS, fixed = list(lscale.i = 0,
## lscale.x2 = 0, lscale.x3 = 0, lscale.x4 = 0, lscale.x5 = 0),
## data = data)
##
## Coefficients:
## lscale0 lscale.q lscale.d lscale.i lscale.x2 lscale.x3
## 0.82931401 0.07389531 -0.13319120 0.00000000 0.00000000 0.00000000
## lscale.x4 lscale.x5 lshape
## 0.00000000 0.00000000 -1.04739554
##
## Log-likelihood: -447.39
For later use, it will be handy to have the models’ names and dimensions
m.dims <- sapply(m.fits, function(m) attr(logLik(m),"df"))
m.names <- names(m.fits)
To apply leave-one-out CV, we define the following function which uses pbmclapply
—a parallel implementation of lapply
—to compute the pointwise predictive log-likelihood (nll.pred
function) for all 369 LOO data splits. Using 30 cores, computation time is reduced from 15 mins to just 30 secs (i.e., embarrassingly parallel).
MAX_CORES <- 30
goby.loo <- function(data){
pbmclapply(1:nrow(data), function(i){ # parallel lapply
fit.train <- fit.goby(data[-i,]) # fits each model to training set
sapply(fit.train, function(m) nll.pred(m, data[i,]))*2 # predicts to test point
}, mc.cores = MAX_CORES) %>% sapply(identity) %>% t %>% as_tibble # combines lpo estimates
}
m.loo.pw <- goby.loo(GobySurvival) # 28 secs using 30 cores
The tibble m.loo.pw
has a column for each model and a row for each data point
sample_n(m.loo.pw,5)
## # A tibble: 5 x 10
## xqdi qdi qd xq d q zero xd x xqd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.998 1.22 1.18 1.01 1.26 1.03 1.10 1.06 1.09 0.954
## 2 0.784 0.599 0.582 0.731 0.696 0.572 0.672 0.900 0.837 0.772
## 3 1.06 1.28 1.32 1.12 1.26 1.15 1.10 1.06 1.09 1.09
## 4 0.729 0.847 0.848 0.875 0.874 1.09 1.10 0.750 0.881 0.727
## 5 0.681 0.797 0.819 0.850 0.874 1.06 1.10 0.750 0.881 0.698
An initial look at the loo scores shows that model qd
has the lowest estimate
m.loo <- m.loo.pw %>% map_dbl(sum) # sum over columns to get the loo score (deviance)
m.min <- m.loo %>% which.min # store lowest-scoring model for later use
m.loo %>% {.-min(.)} # delta scores
## xqdi qdi qd xq d q zero xd
## 1.8857119 0.6319059 0.0000000 4.8139682 1.7127340 4.9208779 5.8406776 2.9585067
## x xqd
## 5.5788585 0.9078763
To estimate the uncertainty and correlation of the loo estimates, we compute the deviance for a set of (non-parametric) bootstrap samples:
goby.boot <- function(data, nboot){
pbmclapply(1:nboot, function(i){
fitted <- F
while (!fitted){
GS.boot = try(fit.goby(sample_n(data,nrow(data), replace = T)))
fitted = sum(sapply(GS.boot, is.character))==0
}
sapply(GS.boot, function(x) logLik(x)*-2)
}, mc.cores = MAX_CORES) %>% sapply(identity) %>% t %>% as_tibble
}
m.boot <- goby.boot(GobySurvival, 1000) # 65 secs using 30 cores
The tibble m.boot
contains a column for each model and a row for each bootstrap sample
sample_n(m.boot,5)
## # A tibble: 5 x 10
## xqdi qdi qd xq d q zero xd x xqd
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 848. 857. 857. 868. 860. 881. 883. 852. 869. 849.
## 2 852. 878. 885. 862. 886. 892. 892. 861. 862. 861.
## 3 892. 905. 911. 897. 912. 913. 913. 897. 898. 897.
## 4 849. 858. 860. 858. 862. 868. 870. 853. 859. 851.
## 5 878. 896. 898. 891. 898. 906. 906. 881. 891. 880.
We use m.boot
to compute both the standard errors and modified (i.e., corellation-adjusted) standard errors—there is a huge difference between the two due to significant correlation.
se_ose <- m.boot %>% map_dbl(sd)
se_mod <- se_ose[m.min]*sqrt(1-cor(m.boot))[m.min,]
se_ose; se_mod
## xqdi qdi qd xq d q zero xd
## 40.43763 40.53657 40.66256 40.25285 40.62923 40.20120 40.11293 40.49120
## x xqd
## 40.15564 40.54093
## xqdi qdi qd xq d q zero xd
## 4.704860 1.603968 0.000000 5.355562 2.523417 3.798588 4.137909 5.042653
## x xqd
## 5.528947 4.522949
# se_ose[m.min]*sqrt(1-cor(m.loo*100))[m.min,]
To plot the results and apply the OSE rule, we make a summary table that can be passed to some pre-written plotting functions
# create model summary table
summaryTable <- tibble(model = m.names,
dim = m.dims[model],
score = m.loo[model],
delScore = score - min(score),
se_ose = se_ose[model],
se_mod = se_mod[model]) %>%
arrange(dim) %>%
mutate(index = 1:length(dim))
summaryTable
## # A tibble: 10 x 7
## model dim score delScore se_ose se_mod index
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 zero 2 908. 5.84 40.1 4.14 1
## 2 d 3 904. 1.71 40.6 2.52 2
## 3 q 3 907. 4.92 40.2 3.80 3
## 4 qd 4 902. 0 40.7 0 4
## 5 qdi 5 903. 0.632 40.5 1.60 5
## 6 x 6 908. 5.58 40.2 5.53 6
## 7 xq 7 907. 4.81 40.3 5.36 7
## 8 xd 7 905. 2.96 40.5 5.04 8
## 9 xqd 8 903. 0.908 40.5 4.52 9
## 10 xqdi 9 904. 1.89 40.4 4.70 10
Finally, we can plot the resuls
source("OSE_functions.R")
ggarrange(plotOSE(summaryTable),
plotModifiedOSE(summaryTable),
common.legend = T, legend = "bottom")