Linear Modelling

Author

Peter Levy

Published

January 18, 2024

The aim of this practical is to estimate the parameters of very simple linear models in the Bayesian way. We will use the ideas from the previous sessions about MCMC and prior distributions. We will show that estimating models in this way can be almost as quick and easy as the standard least-squares approach (e.g. lm). We will use the R package rstanarm, which has familiar syntax but uses an MCMC algorithm “under the hood”, and uses sensible generic prior distributions by default. The application is to a data set on dimensions of individual trees, and how different attributes scale in relation to each other.

Work through this practical at your own pace. You can ask one of the trainers for help at any time. We will reconvene every 15 minutes or so to explore your findings and address common questions and issues.

Tree allometry

Allometry is the study of how the characteristics of living things change with their size1. The use of allometry is widespread in forestry and forest ecology. Allometric relationships are used to estimate attributes of trees that are difficult to measure, such as volume or mass, from more easily measured attributes such as diameter at breast height (dbh). For example, we might want to estimate the total above-ground biomass (and thus carbon sequestration) of a forest. Measuring the actual mass of all the trees non-destructively is impossible. However, we can measure the diameter (and other properties) of the trees, and use this to predict the total biomass with an allometric relationship.

We know there will be allometric relationships from the basic geometry (tall trees are generally wider), but other factors come into play which affect tree growth form and wood density. These include species, spacing between trees, climate and site location. Here, the aim is to model these allometric relationships using linear models.

Tree allometry data

First, we read in some data for UK conifer species (Levy, Hale, and Nicoll 2004). This contains measurements of diameter at breast height (dbh in cm) and total above-ground biomass (tree_mass in kg) for 1901 trees.

Run the following code in R. You can use the clipboard symbol below to copy and paste.

df <- read.csv(url("https://raw.githubusercontent.com/NERC-CEH/beem_data/main/trees.csv"))
names(df)
 [1] "tree_number"   "age"           "species"       "cultivation"  
 [5] "soil_type"     "tree_height"   "timber_height" "dbh"          
 [9] "tree_mass"     "stem_mass"     "taper"         "crown_mass"   
[13] "root_mass"    
dim(df)
[1] 1901   13

It is common practice to transfom the diameter at breast height to a cross-sectional area (\(\pi (d/2)^2\), referred to as “basal area”), as this gives a more linear relationship with volume and mass.

df$basal_area <- pi * (df$dbh/2)^2

Plot the relationship between basal area and tree mass using your own code, or that below.

Code
library(ggplot2)
ggplot(df, aes(basal_area, tree_mass)) + geom_point()

Is there a reasonable linear relationship? If so, we can approximate the data with a linear model.

Bayesian Estimation of The Linear Model

The standard linear model uses the equation:

\[y = \alpha + \beta x + \epsilon\]

where in this case, \(y\) is the tree mass, \(x\) is the basal area, \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\epsilon\) is the unexplained residual or “error” term. \(\beta\) expresses the change in tree mass (in kg) with a unit increase in basal area (in cm2), so has units of kg/cm2. \(\alpha\) is the hypothetical mass of a tree with zero basal area; you could interpret this as the mass of a tree just short of breast height (1.3 m). We assume \(\epsilon\) comes from a normal distribution with mean zero and standard deviation \(\sigma\).

We can estimate these model parameters in the Bayesian approach using the R package rstanarm2 with the code below. The function stan_glm uses syntax familiar to R users, with a formula argument (the same as in lm and many other related model-fitting functions). However, it uses a Markov chain Monte Carlo (MCMC) algorithm3 to estimate the parameters (instead of least squares or optimisation algorithms).

Load the libraries

Load the rstanarm library. ggeffects is used later for plotting output.

library(rstanarm)
library(ggeffects)
Fit the model

The syntax below specifies the model: \(\mathrm{tree \_ mass} = \alpha + \beta \times \mathrm{basal \_ area} + \epsilon\) and the data frame to use.

m <- stan_glm(tree_mass ~ basal_area, data = df, iter = 300, refresh = 50)
More iterations

Set refresh to 0 to remove progress reporting during the MCMC runs. Around 1000 iterations seems to work, but we can use several thousand because this simple model is very fast.

m <- stan_glm(tree_mass ~ basal_area, data = df, iter = 20000, refresh = 0)
Examine the output

In common with other model-fitting functions in R, the \(\beta\) regression coefficient which acts as a multiplier on basal_area is denoted simply as basal_area in the output; the \(\alpha\) coeficient is denoted as (Intercept). We can examine the trace plots from the iterations of the MCMC chains:

plot(m, "trace")
Output

and we can now get some summary output for the estimated posterior distribution:

summary(m, digits = 5)
Output

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      tree_mass ~ basal_area
 algorithm:    sampling
 sample:       40000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 1901
 predictors:   2

Estimates:
              mean      sd        10%       50%       90%    
(Intercept) -59.14295   3.98772 -64.26226 -59.14383 -54.03419
basal_area    1.05395   0.00942   1.04188   1.05393   1.06609
sigma        87.98912   1.42540  86.18186  87.96906  89.82551

Fit Diagnostics:
           mean      sd        10%       50%       90%    
mean_PPD 324.72927   2.85887 321.07122 324.72221 328.41744

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse    Rhat    n_eff
(Intercept)   0.02027 0.99996 38711
basal_area    0.00005 0.99999 37923
sigma         0.00746 1.00000 36519
mean_PPD      0.01446 0.99998 39111
log-posterior 0.00915 1.00016 17821

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

and the posterior distribution of the parameters:

plot(m, "hist")
Output

We can also explore results of the MCMC via a built-in Shiny app, launched by the launch_shinystan command:

launch_shinystan(m)

This provides a huge amount of detail - much more than you need just now, but key things are to check convergence and parameter distributions.

Examine the model fit with summary(m) and launch_shinystan(m).

Questions
  1. Have the MCMC chains converged?
    Hints Visually inspect the trace plots for systematic changes, divergence. Is the Gelman-Rubin Statistic Rhat close to 1?
  2. Do the parameter distributions look “sensible”?
  3. On average, how accurate is the model?
    Hints Check the value of \(\sigma\) (sigma) in the summary output. Examine the posterior distribution of sigma.
Check predictions
Plot the predictions yourself or with the code below. The ggpredict function can be used to extract the model predictions and intervals for plotting with ggplot.
Code
df_pred <- as.data.frame(ggpredict(m, terms = "basal_area [0:2400]"))
p <- ggplot(df_pred, aes(x, predicted))
p <- p + geom_ribbon(aes(ymin = conf.low, ymax = conf.high), fill = "pink")
p <- p + geom_line()
p <- p + geom_point(data = df, aes(basal_area, tree_mass))
p

We can create 95 % prediction intervals by getting the appropriate quantiles of the posterior samples. Taking a mid-range basal area of 1000 cm2 (dbh ~ 35 cm):

y_rep <- posterior_predict(m, newdata = data.frame(basal_area = 1000))
quantile(y_rep, c(0.025, 0.5, 0.975))
     2.5%       50%     97.5% 
 820.0768  994.3886 1166.4670 

We can also check the model predictions by plotting how well they reproduce the observed data. In these plots, \(y\) is the observations, \(y_rep\) are (replicate) samples from the posterior distribution.

pp_check(m, plotfun = "boxplot", nreps = 10, notch = FALSE)
pp_check(m, plotfun = "hist", nreps = 3)
pp_check(m)
Output
pp_check(m, plotfun = "boxplot", nreps = 10, notch = FALSE)

pp_check(m, plotfun = "hist", nreps = 3)

pp_check(m)

Questions
  1. How satisfactory is the model? How might it be improved?

Further things to explore

Variation among species

Is the model likely to be improved if we account for variation among species?
Code
p <- ggplot(df, aes(basal_area, tree_mass, colour = species))
p <- p + geom_point()
p <- p + stat_smooth(method = "lm")
p

Multivariate version

Is the model improved by including additional variables?
Code
m_multi <- stan_glm(tree_mass ~ basal_area +
  age + species + cultivation + soil_type + tree_height + timber_height, 
  data = df, refresh = 0)
summary(m_multi)

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      tree_mass ~ basal_area + age + species + cultivation + soil_type + 
       tree_height + timber_height
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 1881
 predictors:   34

Estimates:
                mean   sd     10%    50%    90% 
(Intercept)   -203.2   13.9 -221.1 -202.7 -185.6
basal_area       0.8    0.0    0.8    0.8    0.8
age             -0.2    0.3   -0.7   -0.2    0.2
speciesDF       23.8   14.9    4.5   23.6   42.8
speciesEL        0.3   16.6  -20.9    0.3   21.7
speciesGF      -16.3   13.9  -34.0  -16.5    1.6
speciesJL      -14.7   13.2  -31.4  -14.8    2.3
speciesLC       51.8   28.3   16.4   51.7   89.1
speciesLP       19.1    9.4    7.0   19.2   31.3
speciesNF       43.3   18.4   20.0   43.5   66.9
speciesNS       28.9   10.1   15.8   29.1   41.9
speciesRC      -58.0   25.8  -91.6  -57.7  -25.2
speciesSP        5.1    9.8   -8.0    5.2   17.5
speciesSS       32.4    8.5   21.4   32.3   43.5
speciesWH       13.7   13.1   -3.2   13.6   29.9
cultivation     -0.1    0.8   -1.0   -0.1    0.9
soil_type10     -8.9   17.0  -31.1   -9.0   12.5
soil_type11    -67.3   16.3  -88.4  -67.2  -46.3
soil_type12    -59.4   17.0  -81.3  -59.5  -38.2
soil_type13    -92.9   23.5 -122.6  -93.5  -62.3
soil_type1a     -1.9   23.9  -31.8   -2.5   29.1
soil_type1g     -8.9   14.1  -27.3   -8.8    9.3
soil_type1x    -18.9   18.2  -42.6  -18.7    4.2
soil_type3       0.8    7.0   -8.2    0.9    9.9
soil_type4       2.9    7.6   -6.7    2.9   12.6
soil_type4b     -2.8   10.0  -15.6   -2.8   10.1
soil_type5b    -21.2   11.2  -35.9  -21.1   -6.6
soil_type6     -10.9    5.7  -18.2  -10.8   -3.6
soil_type7       8.7    6.0    1.0    8.7   16.4
soil_type7b   -242.2   12.9 -259.0 -242.1 -225.7
soil_type8       0.6   24.3  -32.2    1.4   31.7
soil_type9      -5.2    8.3  -16.0   -5.0    5.4
tree_height      5.3    2.1    2.6    5.3    8.0
timber_height   13.2    2.2   10.4   13.3   16.0
sigma           65.6    1.1   64.3   65.6   67.1

Fit Diagnostics:
           mean   sd    10%   50%   90%
mean_PPD 325.4    2.1 322.6 325.4 328.1

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.3  1.0  1667 
basal_area    0.0  1.0  3904 
age           0.0  1.0  3756 
speciesDF     0.4  1.0  1809 
speciesEL     0.4  1.0  2131 
speciesGF     0.3  1.0  1796 
speciesJL     0.3  1.0  2053 
speciesLC     0.4  1.0  4538 
speciesLP     0.3  1.0  1290 
speciesNF     0.3  1.0  3181 
speciesNS     0.3  1.0  1338 
speciesRC     0.4  1.0  4374 
speciesSP     0.3  1.0  1536 
speciesSS     0.2  1.0  1201 
speciesWH     0.3  1.0  1902 
cultivation   0.0  1.0  5079 
soil_type10   0.2  1.0  4840 
soil_type11   0.2  1.0  5612 
soil_type12   0.3  1.0  4482 
soil_type13   0.3  1.0  6550 
soil_type1a   0.3  1.0  5501 
soil_type1g   0.2  1.0  5639 
soil_type1x   0.2  1.0  5368 
soil_type3    0.1  1.0  2811 
soil_type4    0.2  1.0  2034 
soil_type4b   0.2  1.0  3862 
soil_type5b   0.2  1.0  2672 
soil_type6    0.1  1.0  2220 
soil_type7    0.1  1.0  2622 
soil_type7b   0.2  1.0  3183 
soil_type8    0.3  1.0  6053 
soil_type9    0.2  1.0  3056 
tree_height   0.0  1.0  1826 
timber_height 0.1  1.0  1774 
sigma         0.0  1.0  3219 
mean_PPD      0.0  1.0  3909 
log-posterior 0.1  1.0  1432 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Comparison with least-squares fitting

Compare the results with those from a standard least-squares fitting approach. They should give very similar answers in this simple case.

m_lm <- lm(tree_mass ~ basal_area, data = df, na.action = na.exclude)
Code
m_lm <- lm(tree_mass ~ basal_area, data = df, na.action = na.exclude)
summary(m_lm)

Call:
lm(formula = tree_mass ~ basal_area, data = df, na.action = na.exclude)

Residuals:
     Min       1Q   Median       3Q      Max 
-1029.19   -29.82     3.11    32.81   591.73 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -59.142595   3.953655  -14.96   <2e-16 ***
basal_area    1.053948   0.009336  112.89   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 87.94 on 1899 degrees of freedom
Multiple R-squared:  0.8703,    Adjusted R-squared:  0.8702 
F-statistic: 1.274e+04 on 1 and 1899 DF,  p-value: < 2.2e-16
hist(resid(m_lm))

predict(m_lm, newdata = data.frame(basal_area = 1000), interval = 'prediction')
       fit      lwr      upr
1 994.8051 821.8989 1167.711

Specifying priors

A Bayesian analysis considers what prior knowledge is available about the model and parameters, and combines this with the available data.

We did this using the default “weak prior”, which specifies very wide (but not uniform) distributions for all parameters. We can examine the default prior distributions for this model with the prior_PD option:

m <- stan_glm(tree_mass ~ basal_area, data = df, prior_PD = TRUE, refresh = 0)
Code
prior_summary(m)
Priors for model 'm' 
------
Intercept (after predictors centered)
  Specified prior:
    ~ normal(location = 325, scale = 2.5)
  Adjusted prior:
    ~ normal(location = 325, scale = 610)

Coefficients
  Specified prior:
    ~ normal(location = 0, scale = 2.5)
  Adjusted prior:
    ~ normal(location = 0, scale = 2.8)

Auxiliary (sigma)
  Specified prior:
    ~ exponential(rate = 1)
  Adjusted prior:
    ~ exponential(rate = 0.0041)
------
See help('prior_summary.stanreg') for more details

Could we do better than this?

Further reading

References

Levy, P. E., S. E. Hale, and B. C. Nicoll. 2004. “Biomass Expansion Factors and Root: Shoot Ratios for Coniferous Tree Species in Great Britain.” Forestry 77 (5): 421–30. https://doi.org/10.1093/forestry/77.5.421.
West, Geoffrey B., James H. Brown, and Brian J. Enquist. 1997. “A General Model for the Origin of Allometric Scaling Laws in Biology.” Science 276 (5309): 122–26. https://doi.org/10.1126/science.276.5309.122.

Footnotes

  1. Biological scaling relationships apply to a wide range of morphological traits (e.g. brain size and body size), physiological traits (e.g. metabolic rate and body size in animals) and ecological traits. Some general principles seem to apply, probably because of “how essential materials are transported through space-filling fractal networks of branching tubes(West, Brown, and Enquist 1997).↩︎

  2. “stan” is a computer language, “ARM” is a book: “Applied Regression Modelling”.↩︎

  3. The default alogorithm is Hamiltonian MCMC, but the details are not important just now.↩︎