Introduction

Load data and specify models

The Goby survival data set contains measurements from experimental manipulations on Elacatinus evelynae and E. prochilos in the US Virgin Islands, 2000-2002 (Wilson 2004). The data are compiled from five experiments, \(x = 1,2,...,5\), collected over three years, and include information on animal density \(d\) and habitat quality \(q\). The fraction of surviving Gobies \(T(t)\) are modelled using the Weibull distribution; the aim of the analysis is to investigate the effect of density and habitat quality on mortality rate. Following the analysis of (Bolker 2008) (p276-283), we consider a set of 10 candidate models characterised by the dependence of the scale parameter \(s\) on the variables \(x\), \(d\) and \(q\). The most complex model, denoted (\(\mbox{xqdi}\)), is

\[\begin{eqnarray*} &&T \sim Weibull(a,s_x(d,q))\\[2mm] &&s_x(d,q) = exp(\alpha_x + \beta q + (\gamma + \delta q)d), \end{eqnarray*}\]

with \(s\) depending on \(x,q,d,\) and an interaction \(i\), between \(q\) and \(d\). Models of lower complexity, with various dependencies omitted, are labelled in the same manner, together with (\(\mbox{zero}\)), which denotes the simplest model (a single shape and scale parameter). The data set contains 369 rows.

We load the data from the R package emdbook (with thanks to Jackie Wilson for giving permission for their use) and make a couple of small preparations

library(emdbook)
data(GobySurvival)

GobySurvival$day1 <-  GobySurvival$d1-1
GobySurvival$day2 = ifelse(GobySurvival$d2==70,Inf,GobySurvival$d2-1)

We load a script containing functions to specify, compute and (numerically) optimise the model likelihoods (see GitHub repo).

source("goby_functions.R")
# nll.xqdi computes negative log-likelihood of the full model given data
# nll.pred computes predictive negative log-likelihood for given fitted model and test data
# fit.goby fits all 10 goby survival models using MLE, given data

For example, the function nll.xqdi returns the negative log-likelihood of the full model (\(\mbox{xqdi}\)):

nll.xqdi
## function (lscale0, lscale.q, lscale.d, lscale.i, lscale.x2, lscale.x3, 
##     lscale.x4, lscale.x5, lshape) 
## {
##     lscalediff = c(0, lscale.x2, lscale.x3, lscale.x4, lscale.x5)
##     scale = exp(lscale0 + lscalediff[exper] + lscale.q * qual + 
##         (lscale.d + lscale.i * qual) * density)
##     shape = exp(lshape)
##     suppressWarnings(-sum(log(pweibull(day2, shape, scale) - 
##         pweibull(day1, shape, scale))))
## }

The function fit.goby specifies all 10 models (by fixing subset of the paramters to 0) and fits them using the numerical optimiser bbmle::mle2, returning a list of the fitted objects. For example, we use the function to fit all models to the full data set and print the fitted object for the (\(\mbox{qd}\)) model:

m.fits <- fit.goby(GobySurvival)
m.fits$qd 
## 
## Call:
## mle2(minuslogl = nll.xqdi, start = startvals.GS, fixed = list(lscale.i = 0, 
##     lscale.x2 = 0, lscale.x3 = 0, lscale.x4 = 0, lscale.x5 = 0), 
##     data = data)
## 
## Coefficients:
##     lscale0    lscale.q    lscale.d    lscale.i   lscale.x2   lscale.x3 
##  0.82931401  0.07389531 -0.13319120  0.00000000  0.00000000  0.00000000 
##   lscale.x4   lscale.x5      lshape 
##  0.00000000  0.00000000 -1.04739554 
## 
## Log-likelihood: -447.39

For later use, it will be handy to have the models’ names and dimensions

m.dims <- sapply(m.fits, function(m) attr(logLik(m),"df"))
m.names <- names(m.fits)

Apply LOO

To apply leave-one-out CV, we define the following function which uses pbmclapply—a parallel implementation of lapply—to compute the pointwise predictive log-likelihood (nll.pred function) for all 369 LOO data splits. Using 30 cores, computation time is reduced from 15 mins to just 30 secs (i.e., embarrassingly parallel).

MAX_CORES <- 30

goby.loo <- function(data){
  pbmclapply(1:nrow(data), function(i){ # parallel lapply
    fit.train <- fit.goby(data[-i,]) # fits each model to training set
    sapply(fit.train, function(m) nll.pred(m, data[i,]))*2 # predicts to test point
  }, mc.cores = MAX_CORES) %>% sapply(identity) %>% t %>% as_tibble # combines lpo estimates
}

m.loo.pw <- goby.loo(GobySurvival) # 28 secs using 30 cores

The tibble m.loo.pw has a column for each model and a row for each data point

sample_n(m.loo.pw,5)
## # A tibble: 5 x 10
##    xqdi   qdi    qd    xq     d     q  zero    xd     x   xqd
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.998 1.22  1.18  1.01  1.26  1.03  1.10  1.06  1.09  0.954
## 2 0.784 0.599 0.582 0.731 0.696 0.572 0.672 0.900 0.837 0.772
## 3 1.06  1.28  1.32  1.12  1.26  1.15  1.10  1.06  1.09  1.09 
## 4 0.729 0.847 0.848 0.875 0.874 1.09  1.10  0.750 0.881 0.727
## 5 0.681 0.797 0.819 0.850 0.874 1.06  1.10  0.750 0.881 0.698

An initial look at the loo scores shows that model qd has the lowest estimate

m.loo <- m.loo.pw %>% map_dbl(sum) # sum over columns to get the loo score (deviance)
m.min <- m.loo %>% which.min # store lowest-scoring model for later use
m.loo %>% {.-min(.)} # delta scores
##      xqdi       qdi        qd        xq         d         q      zero        xd 
## 1.8857119 0.6319059 0.0000000 4.8139682 1.7127340 4.9208779 5.8406776 2.9585067 
##         x       xqd 
## 5.5788585 0.9078763

Applying the OSE rule

To estimate the uncertainty and correlation of the loo estimates, we compute the deviance for a set of (non-parametric) bootstrap samples:

goby.boot <- function(data, nboot){
  pbmclapply(1:nboot, function(i){
    fitted <- F
    while (!fitted){
      GS.boot = try(fit.goby(sample_n(data,nrow(data), replace = T)))
      fitted = sum(sapply(GS.boot, is.character))==0
    }
    sapply(GS.boot, function(x) logLik(x)*-2)
  }, mc.cores = MAX_CORES) %>% sapply(identity) %>% t %>%  as_tibble
}

m.boot <- goby.boot(GobySurvival, 1000) # 65 secs using 30 cores 

The tibble m.boot contains a column for each model and a row for each bootstrap sample

sample_n(m.boot,5)
## # A tibble: 5 x 10
##    xqdi   qdi    qd    xq     d     q  zero    xd     x   xqd
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  848.  857.  857.  868.  860.  881.  883.  852.  869.  849.
## 2  852.  878.  885.  862.  886.  892.  892.  861.  862.  861.
## 3  892.  905.  911.  897.  912.  913.  913.  897.  898.  897.
## 4  849.  858.  860.  858.  862.  868.  870.  853.  859.  851.
## 5  878.  896.  898.  891.  898.  906.  906.  881.  891.  880.

We use m.boot to compute both the standard errors and modified (i.e., corellation-adjusted) standard errors—there is a huge difference between the two due to significant correlation.

se_ose <- m.boot %>% map_dbl(sd)
se_mod <- se_ose[m.min]*sqrt(1-cor(m.boot))[m.min,]
se_ose; se_mod
##     xqdi      qdi       qd       xq        d        q     zero       xd 
## 40.43763 40.53657 40.66256 40.25285 40.62923 40.20120 40.11293 40.49120 
##        x      xqd 
## 40.15564 40.54093
##     xqdi      qdi       qd       xq        d        q     zero       xd 
## 4.704860 1.603968 0.000000 5.355562 2.523417 3.798588 4.137909 5.042653 
##        x      xqd 
## 5.528947 4.522949
# se_ose[m.min]*sqrt(1-cor(m.loo*100))[m.min,]

To plot the results and apply the OSE rule, we make a summary table that can be passed to some pre-written plotting functions

# create model summary table
summaryTable <- tibble(model = m.names, 
                       dim = m.dims[model],
                       score = m.loo[model],
                       delScore = score - min(score),
                       se_ose = se_ose[model],
                       se_mod = se_mod[model]) %>% 
                    arrange(dim) %>% 
                    mutate(index = 1:length(dim))
summaryTable
## # A tibble: 10 x 7
##    model   dim score delScore se_ose se_mod index
##    <chr> <int> <dbl>    <dbl>  <dbl>  <dbl> <int>
##  1 zero      2  908.    5.84    40.1   4.14     1
##  2 d         3  904.    1.71    40.6   2.52     2
##  3 q         3  907.    4.92    40.2   3.80     3
##  4 qd        4  902.    0       40.7   0        4
##  5 qdi       5  903.    0.632   40.5   1.60     5
##  6 x         6  908.    5.58    40.2   5.53     6
##  7 xq        7  907.    4.81    40.3   5.36     7
##  8 xd        7  905.    2.96    40.5   5.04     8
##  9 xqd       8  903.    0.908   40.5   4.52     9
## 10 xqdi      9  904.    1.89    40.4   4.70    10

Finally, we can plot the resuls

source("OSE_functions.R")
ggarrange(plotOSE(summaryTable), 
          plotModifiedOSE(summaryTable), 
          common.legend = T, legend = "bottom")

Bolker, Benjamin M. 2008. Ecological models and data in R. Princeton University Press. https://doi.org/10.1111/j.1442-9993.2010.02210.x.
Wilson, J A. 2004. “Habitat Quality, Competition, and Recruitment Processes in Two Marine Gobies.” PhD thesis.