Model Selection and Comparison

Author

Pete Henrys

Published

January 17, 2024

Introduction

In this practical, we are going to go back over the models fit within the linear modelling practical session 3a and assess their fit in more detail and compare different models.

Tree allometry models

Model evaluation

Yesterday, we were looking at fitting different models to tree mass data. In the practical session there were multiple variables to include within the tree mass model and the code below was included to show the full list of potential metrics.

Remember to load the libraries and read in the data exactly as in the previous practical session

library(rstanarm)
library(shinystan)

df <- read.csv(url("https://raw.githubusercontent.com/NERC-CEH/beem_data/main/trees.csv"))
df$basal_area <- pi * (df$dbh/2)^2

Now we can fit the linear model with multiple predictor variables

m_multi <- stan_glm(tree_mass ~ basal_area +
  age + species + cultivation + soil_type + tree_height + timber_height, 
  data = df, refresh = 0)

summary(m_multi)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      tree_mass ~ basal_area + age + species + cultivation + soil_type + 
       tree_height + timber_height
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 1881
 predictors:   34

Estimates:
                mean   sd     10%    50%    90% 
(Intercept)   -202.9   14.0 -220.6 -202.9 -184.5
basal_area       0.8    0.0    0.8    0.8    0.8
age             -0.3    0.4   -0.7   -0.3    0.2
speciesDF       24.4   14.6    5.2   24.4   42.8
speciesEL        0.7   16.3  -20.2    0.7   21.5
speciesGF      -16.1   13.9  -34.1  -16.2    1.7
speciesJL      -14.2   12.8  -30.7  -14.3    2.1
speciesLC       51.6   28.3   15.4   51.7   87.5
speciesLP       19.2    9.3    7.1   19.4   30.8
speciesNF       43.2   18.0   20.2   43.0   66.1
speciesNS       29.1   10.0   16.3   29.0   42.1
speciesRC      -58.1   24.3  -88.9  -58.1  -27.6
speciesSP        5.3    9.7   -7.3    5.1   18.1
speciesSS       32.5    8.5   21.7   32.5   43.3
speciesWH       13.5   13.2   -3.4   13.5   30.4
cultivation     -0.1    0.8   -1.1   -0.1    0.9
soil_type10     -9.1   17.0  -31.1   -8.9   12.4
soil_type11    -67.3   16.7  -88.7  -67.5  -45.7
soil_type12    -59.2   16.7  -80.7  -59.2  -38.5
soil_type13    -93.1   23.6 -122.6  -93.6  -62.3
soil_type1a     -1.4   24.5  -33.0   -1.7   30.0
soil_type1g     -9.0   14.6  -28.1   -9.0    9.4
soil_type1x    -18.5   18.0  -41.5  -18.8    4.5
soil_type3       1.1    7.3   -8.3    1.0   10.4
soil_type4       3.3    7.8   -6.8    3.3   13.4
soil_type4b     -2.7   10.4  -16.0   -2.6   10.6
soil_type5b    -20.9   11.5  -35.6  -21.0   -6.3
soil_type6     -10.6    6.0  -18.3  -10.6   -3.0
soil_type7       8.9    6.1    1.0    8.9   16.8
soil_type7b   -241.6   12.9 -258.0 -241.5 -225.1
soil_type8       1.6   24.6  -30.3    1.5   32.9
soil_type9      -4.9    8.6  -15.8   -4.8    6.2
tree_height      5.2    2.0    2.7    5.1    7.7
timber_height   13.4    2.1   10.7   13.4   16.0
sigma           65.7    1.1   64.3   65.7   67.0

Fit Diagnostics:
           mean   sd    10%   50%   90%
mean_PPD 325.4    2.2 322.7 325.5 328.2

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.3  1.0  2282 
basal_area    0.0  1.0  4707 
age           0.0  1.0  3543 
speciesDF     0.3  1.0  2410 
speciesEL     0.3  1.0  2422 
speciesGF     0.3  1.0  2160 
speciesJL     0.3  1.0  2291 
speciesLC     0.4  1.0  5329 
speciesLP     0.2  1.0  1483 
speciesNF     0.3  1.0  3059 
speciesNS     0.3  1.0  1523 
speciesRC     0.4  1.0  4308 
speciesSP     0.2  1.0  1574 
speciesSS     0.2  1.0  1305 
speciesWH     0.3  1.0  2307 
cultivation   0.0  1.0  5983 
soil_type10   0.2  1.0  5513 
soil_type11   0.2  1.0  5654 
soil_type12   0.2  1.0  4912 
soil_type13   0.3  1.0  6815 
soil_type1a   0.3  1.0  6381 
soil_type1g   0.2  1.0  4931 
soil_type1x   0.2  1.0  6423 
soil_type3    0.1  1.0  2632 
soil_type4    0.2  1.0  1965 
soil_type4b   0.2  1.0  3089 
soil_type5b   0.2  1.0  2908 
soil_type6    0.1  1.0  1955 
soil_type7    0.1  1.0  2338 
soil_type7b   0.2  1.0  3555 
soil_type8    0.3  1.0  6486 
soil_type9    0.2  1.0  2960 
tree_height   0.0  1.0  2221 
timber_height 0.0  1.0  2016 
sigma         0.0  1.0  3308 
mean_PPD      0.0  1.0  3726 
log-posterior 0.1  1.0  1754 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Your task within this practical is to fit different versions of this model with different terms in (perhaps even polynomial terms) and to evaluate the model fit of each.

To launch shinystan and evaluate the different properties of the model fit, use:

launch_shinystan(m_multi)

Take time to understand the different aspects of the model fit, taking particular notice of the effective sample size and the Monte Carlo standard error.

We can also visualise the fit using the posterior check:

pp_check(m_multi)

Try this for at least 4 or 5 different versions of the model, noting which you believe to be the most appropriate and why.

Model comparison

Now we will use the loo package to formally compare different models.

So lets fit a second model that includes a polynomial of order 2 (a squared term) for age. Our aim is to adequately evaluate and compare this model with the original to understand if this is a significant improvement

Fit the model using the following code

m_multi_2 <- update(m_multi, formula. = . ~ . + I(age^2))

Now run through the checks as above to evaluate this new model. If all seems to be reasonable, we can consider at this stage if we believe one model to be “better” than the other. The posterior predictive check is useful for this.

But we can also compare the models more formally, using both leave one out cross validation and the WAIC.

library(loo)

#run the efficient LOO CV on the 2 different models using the loo package 
loo_m_1 <- loo(m_multi)
loo_m_2 <- loo(m_multi_2)

Now we have evaluated the loo cv, we can view the diagnostic for each observation as it was left out. This helps to understand the leverage of different observations - similar to cook’s distance that is often used.

par(mfrow = c(1,2))
plot(loo_m_1, label_points = TRUE)
plot(loo_m_2, label_points = TRUE)

This plot shows only a very outliers for either model, and so nothing to worry about.

Finally, we can directly compare the diagnostics from the 2 models to see which is more appropriate.

loo_compare(loo_m_1, loo_m_2)
          elpd_diff se_diff
m_multi_2   0.0       0.0  
m_multi   -85.8      20.7  

In this case, there is quite a large difference in the expected log pointwise deviance between the two models compared to the standard error of that difference, so we are led to favour the model with the additional squared term, even after taking into account that the second model estimates an additional parameter.

Try comparing other models in the same way. Do higher polynomials continue to improve the fit?