Modelling censored log-normal data in mgcv

Modelling censored log-normal data in `mgcv`#

David L Miller
2026-03-11

Here I show how to model censored log-normally-distributed data. This kind of data often occurs in pollutant modelling where we have some concentration we are measuring which is a positive value, has a potentially skewed distribution, and might be censored at some limit for detection. In this case we can model the log of the response data as normally distributed, applying the appropriate transformation at the end of the analysis. Unfortunately this implies a geometric rather than arithmetic mean, which can be difficult to compare to other modelling results.

In this notebook, I’ll show show to perform this modelling using the popular R package mgcv. You can find out more about generalized additive models here. The methods shown here correct for the bias be incur by transforming predictions, ensuring that results are comparable to other modelling that uses the arithmetic mean.

Modelling preamble#

What does it mean to model log-normal data? What we are saying is that we have some variable, Y, which is \(\log_{b}\)-normally[1] distributed so \(Y\sim\log_{b}N(\mu,\sigma^{2})\) where \(\mu\) is the logarithm of the mean and \(\sigma^{2}\) is the base \(b\) logarithm of the variance of \(Y\). This is equivalent to defining a variable \(Z=b^{Y}\) where \(Z\sim N(\mu_{*},\sigma_{*}^{2})\). The log-normal distribution is useful in cases where \(Y\) is defined over the interval \((0,\infty]\) (i.e., \(Y>0\)).

Many kinds of sensor-collected data will be subject to some “limit of detection” (LoD), where there is some point at which observations are recorded to be zero because the value is smaller than the sensor can detect, and/or the observation is arbitrarily high due to a saturation effect. Censored normals can be used in this case. Core to their formulation is the separation of the observation into censored and non-censored groups, where the censored likelihood contributions are evaluations of the cumulative distribution function (CDF) of the normal distribution. Censored normal distributions are implemented in mgcv in the cnorm function.

Arithmetic versus geometric means#

We can estimate estimate the expectation of \(Y\) (\(\mathbb{E}(Y)\)) by either an arithmetic or geometric mean (denoted \(\mathbb{E}_{A}(Y)\) or \(\mathbb{E}_{G}(Y)\), respectively). These are defined as:

Where the geometric “mean” corresponds to the median of the distribution. This difference is shown in the following plot

[1] Throughout I use \(\log_{b}\) to indicate logarithm to the base \(b\) since these calculations work regardless of the base used. When the distribution is referred to informally (i.e., non-mathematically) I drop the base.

library(ggplot2)

xx <- seq(0, 8, length.out=300)

dat <- data.frame(x = c(xx, xx),
                  dl = c(dlnorm(xx, 0, 1),
                         dlnorm(xx, 1, 1/2)),
                  pars = rep(c("0, 1", "1, 1/2"), c(300,300)))

pardat <- data.frame(pars = rep(c("0, 1", "1, 1/2"), c(2,2)),
                     method = rep(c("Arithmetic", "Geometric"),
                                  2),
                     value = exp(c(0+1/2*1^2, 0,
                               1+1/2*1/2^2, 1)))

ggplot(dat) +
  geom_line(aes(x=x, y=dl, group=pars, colour=pars)) +
  geom_vline(aes(xintercept=value, lty=method, group=pars,
                 colour=pars), data=pardat) +
  labs(x="x", y="Probability density", colour="Parameters",
       lty="Estimator") +
  theme_minimal()

../../_images/744aaa7bf4d9360b99af99c64ebd5794f730516f0fd6ea23310a78510ae401cb.png

Similarly, we define the arithmetic and geometric standard deviations as follows:

This is all fine in the simple case where we’re just estimating the distribution parameters, with no covariates involved. What about when we want to use covariates for prediction?

log-normal response in regression-like models#

We may want to model the mean via some linear model, for example when using a GLM or GAM. We decompose the mean as \(\mathbf{X}\boldsymbol{\beta}\), where \(\mathbf{X}\) is an \(n\times p\) design matrix and \(\boldsymbol{\beta}\) is a \(p\)-vector of coefficients to be estimated (where n is the number of observations).

In practice, one can simply assume that the log-response is normally distributed and fit a GLM-a-like (GAM/GLMM/GAMM/etc) with a normal distribution. This model is:

which is clearly modelling the geometric, rather than arithmetic mean (equivalently, we are modelling the median). This naturally extends to the case of random effects and therefore penalized smoothers (i.e., GLMMs, GAMs and GAMMs).

We can post hoc correct predictions to use the arithmetic mean using a multivariate version of the corrections above. These corrections lead to the following expressions:

where \(g(\boldsymbol{\beta})=b^{X\boldsymbol{\beta}}\) if \(b\) is the base we are working in and \(X\) is the design matrix for our model. \(D\) indicates the derivative operator with respect to \(\boldsymbol{\beta}\), with powers of \(D\) giving that order of derivative. \(\Sigma\) is the posterior covariance matrix for \(\boldsymbol{\beta}\) (see, e.g. Miller (2025) for further explanation of covariance matrices in the context of empirical Bayes modelling).

Analysis of pollutant data#

We now start our analysis. This code is based on examples by Mike Dunbar to illustrate fitting a GAMM with censored normal family to water quality time series from multiple sites (Harmonised Monitoring Scheme sites for England) to look at smoothed trends over time and the effects of seasonality. For each determinand we fit a separate model where site is a random effect and we use two temporal smoothers.

We begin by loading the requisite packages.

library(dplyr)
library(ggplot2)
library(mgcv)
library(tictoc)
library(lubridate)
library(tidyr)
library(purrr)
library(gratia)
library(stringr)

There are some non-CRAN packages required which we can install directly from GitHub, then load:

# for nice scales
remotes::install_github("christyray/sciscales")
library(sciscales)

# for the censored log normal helper functions
remotes::install_github("dill/mgcvUtils")
library(mgcvUtils)

Michael Dunbar at the Environment Agency pre-prepared data for this analysis using open data from EA (extracted using EA APIs).

# created in data_prep.R
load('data/HMS_open_data.rda')
head(samples)

# A tibble: 6 × 15
  sample.samplingPoint.notation sample.samplingPoint.label sample.sampleDateTime
  <chr>                         <chr>                      <dttm>               
1 AN-26M31                      HUNDRED FOOT RIVER EARITH… 2000-01-11 10:16:00  
2 AN-26M31                      HUNDRED FOOT RIVER EARITH… 2000-01-11 10:16:00  
3 AN-26M31                      HUNDRED FOOT RIVER EARITH… 2000-01-11 10:16:00  
4 AN-26M31                      HUNDRED FOOT RIVER EARITH… 2008-05-02 13:30:00  
5 AN-26M31                      HUNDRED FOOT RIVER EARITH… 2008-05-02 13:30:00  
6 AN-26M31                      HUNDRED FOOT RIVER EARITH… 2008-05-02 13:30:00  
# ℹ 12 more variables: determinand.label <chr>, determinand.definition <chr>,
#   determinand.notation <chr>, resultQualifier.notation <chr>, result <dbl>,
#   codedResultInterpretation.interpretation <lgl>,
#   determinand.unit.label <chr>, sample.sampledMaterialType.label <chr>,
#   sample.isComplianceSample <lgl>, sample.purpose.label <chr>,
#   sample.samplingPoint.easting <dbl>, sample.samplingPoint.northing <dbl>

This data has four determinands, for brevity here we’ll just look at one: nitrate (where determinand.notation==0117 in our data).

samples <- samples |>
  filter(determinand.notation == "0117")

Data checking#

Plotting the data, we can see how the limit of detection (red) changes over time.[1]

[1] The censored data is coded in the resultQualifier.notation columns with a <.

samples |>
  ggplot() +
  geom_point(aes(x = sample.sampleDateTime,
                 y = result,
                 group = sample.samplingPoint.notation,
                 colour = resultQualifier.notation)) +
  scale_y_log10() +
  labs(x="Date", y="Nitrate", colour="LoD") +
  theme_bw()

../../_images/ee1a1c5c3c475a3dcd9374ec9da7e06e1e98a0a188daaf90ef8fe4e497955db5.png

If we tabulate the LoD values we can see one is certainly more common than the others:

samples |>
  # can comment out
  filter(resultQualifier.notation == '<') |>
  group_by(result) |>
  summarise(n = n())

# A tibble: 9 × 2
   result     n
    <dbl> <int>
0.00001     1
0.1         1
0.192       1
0.195       8
0.196      90
0.5        32
0.504       1
0.648       1
0.996       1

The predominant LoD in the original data is 0.196 although some earlier data it is 0.5. For simplicity in this analysis, we use 0.196.

Final data prep#

When we log-transform we can’t use zero as the lower bound for censored values we just choose a very small value. We use a value 2 orders of magnitude below lowest limit of detection.

nearly.zero <- 0.000001
log.zero <- log10(nearly.zero)

Create some extra columns, ensure categorical variables are factors.

samples <-
  samples |>
  mutate(DATE = date(sample.sampleDateTime),
         year = year(DATE),
         month = month(DATE),
         samplingPoint.F = as.factor(sample.samplingPoint.notation),
         det.code.desc = paste(determinand.notation, determinand.label, sep = '-'),
         DATE_DEC = round(decimal_date(DATE),2),
         MEAS_SIGN = replace_na(resultQualifier.notation, '='),
         log.meas.upper = log10(result + nearly.zero),
         log.meas.lower = if_else(MEAS_SIGN == '<', log.zero, log.meas.upper),
         censored = ifelse(MEAS_SIGN == '<', TRUE, FALSE),
         # these two lines are because for the not artificially censored data,
         # the ..original columns won't be populated yet
         result.original = result,
         qualifier.original = resultQualifier.notation
         )

Finally we create the response matrix, which is two columns (if entries are the same there was no censoring, if they are different there was censoring; see ?mgcv::cnorm):

samples$y <- cbind(samples$log.meas.lower, samples$log.meas.upper)
head(samples$y)

          [,1]      [,2]
[1,] 1.1172713 1.1172713
[2,] 0.8656961 0.8656961
[3,] 0.7259117 0.7259117
[4,] 0.7686382 0.7686382
[5,] 0.7558749 0.7558749
[6,] 0.7810370 0.7810370

Note that although we pass a matrix to mgcv as the response, we only have a univariate response variable, the second column is only for indicating censoring to the code.

Modelling#

We can now fit our generalized additive model. We want to fit a model of the following form:

where \(i\) indexes the observations which have nitrate level \(Y_i\) on date \(t_i\) in month \(m_i\) and station \(z_i\) (where \(\beta_{z}\) is a random intercept for station \(z\)). Generic smoothers are denoted \(s()\). We want to look at any common trend across our monitoring sites, which will be in the terms \(s(t_i)\) and \(s(m_i)\). We use \(b=10\) for this modelling, though any scientifically appropriate base would be usable with these functions/methods.

In practice we fit:

Here we will use bam and fREML for speed here to show that these methods work with large data approaches too.

We need to setup knots for the monthly smoother. Note that we just need to specify the first and last knot, then mgcv will fill in evenly-spaced knots in that range. See this useful blog post by Gavin Simpson about cyclic smoothing.

# for month cyclic cubic spline
sknots <- list(month=c(0.5, 12.5))

We can now fit the model

b <- bam(y ~ s(DATE_DEC, bs = 'cr') +
             s(month, bs = 'cc') +
             s(samplingPoint.F, bs = 're'),
         method = 'fREML',
         family = clognorm(base=10),
         knots = sknots,
         data = samples)


b_uncorrected <- bam(y ~ s(DATE_DEC, bs = 'cr') +
                         s(month, bs = 'cc') +
                         s(samplingPoint.F, bs = 're'),
                         method = 'fREML',
                         family = cnorm(),
                         knots = sknots,
                         data = samples)

Looking at a summary output as usual. In particular we can see that we used the new clognorm family (see Family: line). This shows the base (10) and the standard deviation result (compare to the second summary from the cnorm model).

summary(b)

Family: clog10norm(0.207) 
Link function: identity 

Formula:
y ~ s(DATE_DEC, bs = "cr") + s(month, bs = "cc") + s(samplingPoint.F, 
    bs = "re")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44305    0.05678   7.803 6.35e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                      edf Ref.df       F p-value    
s(DATE_DEC)         8.782  8.986   64.05  <2e-16 ***
s(month)            6.711  8.000  328.23  <2e-16 ***
s(samplingPoint.F) 50.957 51.000 1260.35  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.433   Deviance explained = 77.4%
fREML =  79415  Scale est. = 1         n = 19596

Family: cnorm(0.207) 
Link function: identity 

Formula:
y ~ s(DATE_DEC, bs = "cr") + s(month, bs = "cc") + s(samplingPoint.F, 
    bs = "re")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44305    0.05678   7.803 6.35e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                      edf Ref.df       F p-value    
s(DATE_DEC)         8.782  8.986   64.05  <2e-16 ***
s(month)            6.711  8.000  328.23  <2e-16 ***
s(samplingPoint.F) 50.957 51.000 1260.35  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.433   Deviance explained = 77.4%
fREML =  79415  Scale est. = 1         n = 19596

We note there are no differences in the estimated degrees of freedom (edf), fREML score etc.

Making predictions#

We want to make some predictions over the date range 2000 to 2025 (no extrapolation into the future). We are interested in patterns with and without the seasonality (month) component. We are not interested in per-site predictions (and we’ll exclude that effect in our predictions), so for now we just use one site.

predgrid <- data.frame(date = seq(ymd("2021-01-01"), ymd("2025-09-30"), by=1),
                       samplingPoint.F = samples$samplingPoint.F[1])
predgrid$DATE_DEC <- decimal_date(predgrid$date)
# get decimal months for interpolation
predgrid$month <- (decimal_date(predgrid$date)-year(predgrid$date)) * 12

In the above code we used the clognorm response family, which is provided in the mgcvUtils package. Internally this package fits the same model as cnorm but implements a prediction method that corrects the results, provided we use the type="response" argument to predict.

All this means that we can use the usual predict method and get the corrected predictions transparently (provided we ask for results on the response scale). For example:

predgrid$pred <- predict(b, predgrid, type="response",
                         exclude=c("s(samplingPoint.F)", "s(DATE_DEC)"))

We can add a confidence band around this plot by manually calculating the upper and lower bounds, but mgcvUtils also provides a function to do this for you in an analogous way to the gratia package, via the fitted_values_clognorm function[1].

[1] Note that the fitted_values function in gratia will not work with the clognorm family at this time.

fv <- fitted_values_clognorm(b, predgrid, scale="response",
                             exclude="s(samplingPoint.F)")
head(fv)

  .row       date samplingPoint.F DATE_DEC      month  .fitted         .se
  1 2021-01-01        AN-26M31 2021.000 0.00000000 2.940149 0.002760783
  2 2021-01-02        AN-26M31 2021.003 0.03287671 2.944586 0.002765074
  3 2021-01-03        AN-26M31 2021.005 0.06575342 2.948817 0.002769199
  4 2021-01-04        AN-26M31 2021.008 0.09863014 2.952850 0.002773164
  5 2021-01-05        AN-26M31 2021.011 0.13150685 2.956694 0.002776974
  6 2021-01-06        AN-26M31 2021.014 0.16438356 2.960360 0.002780632
  .lower_ci .upper_ci
2.903744  2.977011
2.908070  2.981561
2.912194  2.985900
2.916124  2.990037
2.919871  2.993982
2.923443  2.997743

We can compare the trend from the corrected predictions alongside the uncorrected. In this case we first must construct covariance matrix for the transformed standard error (using the delta method):

fv_gm <- fitted_values(b_uncorrected, predgrid, exclude="s(samplingPoint.F)")

# partial correction
Xp <- predict(b_uncorrected, predgrid, type="lpmatrix")
tpred <- 10^predict(b_uncorrected, predgrid, type="response")
base <- 10
Dg <- log(base) * sweep(Xp, 1, tpred, "*")
fv_gm$.se <- sqrt(diag(tcrossprod(Dg%*%vcov(b_uncorrected), Dg))/length(b_uncorrected$y))

crit <- gratia:::coverage_normal(0.95)
fv_gm[[".lower_ci"]] <- fv_gm[[".fitted"]] - (crit * fv_gm[[".se"]])
fv_gm[[".upper_ci"]] <- fv_gm[[".fitted"]] + (crit * fv_gm[[".se"]])

We can now compare these graphically:

ggplot(fv) +
  geom_ribbon(aes(x=DATE_DEC, ymin=10^.lower_ci, ymax=10^.upper_ci),
              fill="blue", alpha=0.4, data=fv_gm) +
  geom_line(aes(x=DATE_DEC, y=10^.fitted), lty=2, data=fv_gm, colour="blue") +
  geom_ribbon(aes(x=DATE_DEC, ymin=.lower_ci, ymax=.upper_ci),
              fill="red", alpha=0.4) +
  geom_line(aes(x=DATE_DEC, y=.fitted), colour="red") +
  labs(x="Date", y="Prediction") +
  theme_minimal()

../../_images/f0fc507377cb7edc2582cb7b9745f0f7c4e22fe1b16303ba527cfbdd4dd9346a.png

Here the red line/band indicate the corrected predictions with uncertainty, blue is the uncorrected predictions. We can’t see much difference in the mean estimates (red line overlaps blue) but there is a considerable reduction in uncertainty.

Conclusion#

We can now correct for the bias incurred by using a geometric mean estimator when we really want to use an arithmetic mean. To do this we use the clognorm family as provided by the mgcvUtils package. Once we’ve fitted the model, we can use the standard predict method or to save time, we can use fitted_values_clognorm.