library(rstanarm)
library(shinystan)
<- read.csv(url("https://raw.githubusercontent.com/NERC-CEH/beem_data/main/trees.csv"))
df $basal_area <- pi * (df$dbh/2)^2 df
Model Selection and Comparison
Introduction
In this practical, we are going to go back over the models fit within the linear modelling practical session 3a and assess their fit in more detail and compare different models.
Tree allometry models
Model evaluation
Yesterday, we were looking at fitting different models to tree mass data. In the practical session there were multiple variables to include within the tree mass model and the code below was included to show the full list of potential metrics.
Remember to load the libraries and read in the data exactly as in the previous practical session
Now we can fit the linear model with multiple predictor variables
<- stan_glm(tree_mass ~ basal_area +
m_multi + species + cultivation + soil_type + tree_height + timber_height,
age data = df, refresh = 0)
summary(m_multi)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: tree_mass ~ basal_area + age + species + cultivation + soil_type +
tree_height + timber_height
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 1881
predictors: 34
Estimates:
mean sd 10% 50% 90%
(Intercept) -202.9 14.0 -220.6 -202.9 -184.5
basal_area 0.8 0.0 0.8 0.8 0.8
age -0.3 0.4 -0.7 -0.3 0.2
speciesDF 24.4 14.6 5.2 24.4 42.8
speciesEL 0.7 16.3 -20.2 0.7 21.5
speciesGF -16.1 13.9 -34.1 -16.2 1.7
speciesJL -14.2 12.8 -30.7 -14.3 2.1
speciesLC 51.6 28.3 15.4 51.7 87.5
speciesLP 19.2 9.3 7.1 19.4 30.8
speciesNF 43.2 18.0 20.2 43.0 66.1
speciesNS 29.1 10.0 16.3 29.0 42.1
speciesRC -58.1 24.3 -88.9 -58.1 -27.6
speciesSP 5.3 9.7 -7.3 5.1 18.1
speciesSS 32.5 8.5 21.7 32.5 43.3
speciesWH 13.5 13.2 -3.4 13.5 30.4
cultivation -0.1 0.8 -1.1 -0.1 0.9
soil_type10 -9.1 17.0 -31.1 -8.9 12.4
soil_type11 -67.3 16.7 -88.7 -67.5 -45.7
soil_type12 -59.2 16.7 -80.7 -59.2 -38.5
soil_type13 -93.1 23.6 -122.6 -93.6 -62.3
soil_type1a -1.4 24.5 -33.0 -1.7 30.0
soil_type1g -9.0 14.6 -28.1 -9.0 9.4
soil_type1x -18.5 18.0 -41.5 -18.8 4.5
soil_type3 1.1 7.3 -8.3 1.0 10.4
soil_type4 3.3 7.8 -6.8 3.3 13.4
soil_type4b -2.7 10.4 -16.0 -2.6 10.6
soil_type5b -20.9 11.5 -35.6 -21.0 -6.3
soil_type6 -10.6 6.0 -18.3 -10.6 -3.0
soil_type7 8.9 6.1 1.0 8.9 16.8
soil_type7b -241.6 12.9 -258.0 -241.5 -225.1
soil_type8 1.6 24.6 -30.3 1.5 32.9
soil_type9 -4.9 8.6 -15.8 -4.8 6.2
tree_height 5.2 2.0 2.7 5.1 7.7
timber_height 13.4 2.1 10.7 13.4 16.0
sigma 65.7 1.1 64.3 65.7 67.0
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 325.4 2.2 322.7 325.5 328.2
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.3 1.0 2282
basal_area 0.0 1.0 4707
age 0.0 1.0 3543
speciesDF 0.3 1.0 2410
speciesEL 0.3 1.0 2422
speciesGF 0.3 1.0 2160
speciesJL 0.3 1.0 2291
speciesLC 0.4 1.0 5329
speciesLP 0.2 1.0 1483
speciesNF 0.3 1.0 3059
speciesNS 0.3 1.0 1523
speciesRC 0.4 1.0 4308
speciesSP 0.2 1.0 1574
speciesSS 0.2 1.0 1305
speciesWH 0.3 1.0 2307
cultivation 0.0 1.0 5983
soil_type10 0.2 1.0 5513
soil_type11 0.2 1.0 5654
soil_type12 0.2 1.0 4912
soil_type13 0.3 1.0 6815
soil_type1a 0.3 1.0 6381
soil_type1g 0.2 1.0 4931
soil_type1x 0.2 1.0 6423
soil_type3 0.1 1.0 2632
soil_type4 0.2 1.0 1965
soil_type4b 0.2 1.0 3089
soil_type5b 0.2 1.0 2908
soil_type6 0.1 1.0 1955
soil_type7 0.1 1.0 2338
soil_type7b 0.2 1.0 3555
soil_type8 0.3 1.0 6486
soil_type9 0.2 1.0 2960
tree_height 0.0 1.0 2221
timber_height 0.0 1.0 2016
sigma 0.0 1.0 3308
mean_PPD 0.0 1.0 3726
log-posterior 0.1 1.0 1754
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Your task within this practical is to fit different versions of this model with different terms in (perhaps even polynomial terms) and to evaluate the model fit of each.
To launch shinystan and evaluate the different properties of the model fit, use:
launch_shinystan(m_multi)
Take time to understand the different aspects of the model fit, taking particular notice of the effective sample size and the Monte Carlo standard error.
We can also visualise the fit using the posterior check:
pp_check(m_multi)
Try this for at least 4 or 5 different versions of the model, noting which you believe to be the most appropriate and why.
Model comparison
Now we will use the loo package to formally compare different models.
So lets fit a second model that includes a polynomial of order 2 (a squared term) for age. Our aim is to adequately evaluate and compare this model with the original to understand if this is a significant improvement
Fit the model using the following code
<- update(m_multi, formula. = . ~ . + I(age^2)) m_multi_2
Now run through the checks as above to evaluate this new model. If all seems to be reasonable, we can consider at this stage if we believe one model to be “better” than the other. The posterior predictive check is useful for this.
But we can also compare the models more formally, using both leave one out cross validation and the WAIC.
library(loo)
#run the efficient LOO CV on the 2 different models using the loo package
<- loo(m_multi)
loo_m_1 <- loo(m_multi_2) loo_m_2
Now we have evaluated the loo cv, we can view the diagnostic for each observation as it was left out. This helps to understand the leverage of different observations - similar to cook’s distance that is often used.
par(mfrow = c(1,2))
plot(loo_m_1, label_points = TRUE)
plot(loo_m_2, label_points = TRUE)
This plot shows only a very outliers for either model, and so nothing to worry about.
Finally, we can directly compare the diagnostics from the 2 models to see which is more appropriate.
loo_compare(loo_m_1, loo_m_2)
elpd_diff se_diff
m_multi_2 0.0 0.0
m_multi -85.8 20.7
In this case, there is quite a large difference in the expected log pointwise deviance between the two models compared to the standard error of that difference, so we are led to favour the model with the additional squared term, even after taking into account that the second model estimates an additional parameter.
Try comparing other models in the same way. Do higher polynomials continue to improve the fit?