Handling Censored Data

1 Data Below the Limit of Quantification

Oftentimes, we have observations that are below the limit of quantification (BLOQ), meaning that the assay has only been validated down to a certain value, (the lower limit of quantification, LLOQ). A paper by Stuart Beal goes through 7 methods of handling this BLOQ data in NONMEM. A more recent paper introduces two methods that inflate the residual error for BLOQ values.

In my experience, I’ve seen M1 (dropping any BLOQ values completely), M5 (replace BLOQ values with $LLOQ/2$), and M6 (in a series of consecutive BLOQ values, impute the BLOQ value closes to the maximum with $LLOQ/2$ and drop the rest) most commonly in the pharmacometrics world. M3 (treat the BLOQ values as left-censored data) is very close to being technically correct (it is tecnically for lognormal error models) and is the gold standard, producing less-biased results than the above imputation methods, but it often has trouble converging with optimization methods. M4 (treat the BLOQ values as left-censored data and truncated below at 0) is the most technically correct, but I’ve never seen a single usage in practice. I tried it in NONMEM once, and it was a disaster.

As mentioned, M3 and M4 suffer from numerical issues with optimization methods. However, there are no such issues when performing MCMC sampling in Stan, so I write all of my models to use M4 if the error model is a distribution that is typically unbounded (Gaussian, Student’s-t, …), and M3 if the error model is naturally bounded below at 0 (lognormal - M3 and M4 are equivalent in this case).

2 Treat the BLOQ values as Left-Censored Data (M3)

Instead of tossing out the BLOQ data (M1) or assigning them some arbitrary value (M5-M7), we should keep them in the data set and treat them as left-censored data. This means that the likelihood contribution for observation $y_{ij}$ is calculated differently for observed values than for BLOQ values:

\[\begin{align} \mbox{observed data} &- f\left(y_{ij} \, | \, \theta_i, \sigma, t_{ij} \right) \notag \\ \mbox{BLOQ data} &- F\left(LLOQ \, | \, \theta_i, \sigma, t_{ij} \right) \notag \\ \end{align}\]

where $f\left(y_{ij} \, | \, \theta_i, \sigma, t_{ij} \right)$ is the density (pdf) and $F\left(LLOQ \, | \, \theta_i, \sigma, t_{ij} \right) = P\left(c_{ij} \leq LLOQ\, | \, \theta_i, \sigma, t_{ij} \right)$ is the cumulative distribution function (cdf).

2.1 Stan Code Example for M3

I’ve mostly written Stan models on this site that fit the model with within-chain parallelization, but I’ll demonstrate the concept with code without within-chain parallelization¹.

First, let’s imagine something simple (like in the simple linear regression example) where there are no BLOQs, and we can just write the likelihood just how we would write it on paper:

model{ 
  
  // Priors
  ...
  
  // Likelihood
  y ~ normal(ipred, sigma);

}

or we could equivalently incorporate the likelihood by using the target += syntax:

model{ 
  
  // Priors
  ...
  
  // Likelihood
  target += normal_lpdf(y | ipred, sigma);

}

The above likelihood is vectorized. We can equivalently write it with a for loop:

model{ 
  
  // Priors
  ...
  
  // Likelihood
  for(i in 1:n){
    target += normal_lpdf(y[i] | mu[i], sigma);
  }
  
}

Now if we have BLOQs², we just use the target += syntax and give the likelihood for each observation taking into account whether it is BLOQ or not.

model{ 
  
  // Priors
  ...
  
  // Likelihood
  for(i in 1:n)
    if(bloq[i] == 1){
      target += normal_lcdf(lloq[i] | mu[i], sigma);
    }else{
      target += normal_lpdf(y[i] | mu[i], sigma);
    }
  } 

}

These lines implement the math (on the log scale, since Stan calculates the log-posterior).

\[\begin{align} \mathtt{normal\_lcdf(x)} &= log(F(x)), \notag \\ \mathtt{normal\_lpdf(x)} &= log(f(x)) \notag \\ \end{align}\]

where $f(\cdot)$ and $F(\cdot)$ are the normal density and cumulative distribution functions, respectively³.

3 Treat the BLOQ values as Left-Censored Data and Truncated Below at 0 (M4)

We know that drug concentrations cannot be $< 0$, but the normal distribution has support over ($-\infty, \, \infty$)⁴, so we will assume a normal distribution truncated below at 0. This will have the effect of limiting the support of our assumed distribution to $(0, \, \infty)$. Since we’re assuming a truncated distribution, we need to adjust the likelihood contributions of our data⁵: \[\begin{align} \mbox{observed data} &- \frac{f\left(y_{ij} \, | \, \theta_i, \sigma, t_{ij} \right)}{1 - F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right)} \\ \mbox{BLOQ data} &- \frac{F\left(LLOQ \, | \, \theta_i, \sigma, t_{ij} \right) - F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right)}{1 - F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right)} \\ \end{align}\]

3.1 Stan Code Example for M4

Now that you’ve seen the steps to go from the ~ operator to target += to a for loop in Section 2.1, we’ll just go ahead and write the M4 implementation:

model{ 
  
  // Priors
  ...
  
  // Likelihood
  for(i in 1:n)
    if(bloq[i] == 1){
      target += log_diff_exp(normal_lcdf(lloq[i] | mu[i], sigma),
                             normal_lcdf(0.0 | mu[i], sigma)) -
                normal_lccdf(0.0 | mu[i], sigma);
    }else{
      target += normal_lpdf(y[i] | mu[i], sigma) -
                normal_lccdf(0.0 | mu[i], sigma);
    }
  } 

}

These lines implement the math (on the log scale, since Stan calculates the log-posterior).

\[\begin{align} \mathtt{log\_diff\_exp(normal\_lcdf(lloq), normal\_lcdf(0))} &= log(F(lloq) - F(0)), \notag \\ \mathtt{normal\_lcdf(x)} &= log(F(x)), \notag \\ \mathtt{normal\_lccdf(x)} &= log(1 - F(x)), \notag \\ \mathtt{normal\_lpdf(x)} &= log(f(x)) \notag \\ \end{align}\]

where $f(\cdot)$ and $F(\cdot)$ are the normal density and cumulative distribution functions, respectively⁶.

Footnotes

the code is actually very similar. It’s more a matter of where it goes - if there’s no within-chain parallelization, it goes in the model block. If there is, then it goes into you partial_sum function in the functions block.↩︎
Having lloq[i] allows for a dataset where some observations have different LLOQs than others↩︎
For more information, see the Stan documentation for normal_lcdf(), and normal_lpdf().↩︎
A model that assumes log-normal error has support over $(0, \, \infty)$, so truncation is not an issue. In that case, M3 and M4 are equivalent. Mathematically, you can see this by noting that $F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right) = 0$.↩︎
For observed data with this truncated distribution, we need to “correct” the density so it integrates to 1. Division by $1 - F(\cdot \, | \, \cdot)$ has this effect. For the censored data, the numerator is similar to the M3 method, but we must also account for the fact that it must be $>0$, hence $P(0 \leq y_{ij} \leq LLOQ) = F(LLOQ) - F(0)$. The denominator is corrected in the same manner as for the observed data.↩︎
For more information, see the Stan documentation for log_diff_exp(), normal_lcdf(), normal_lccdf(), and normal_lpdf().↩︎

--- title: "Handling Censored Data" format: html: code-fold: false code-summary: "Code" code-tools: true number-sections: true toc-depth: 5 editor_options: chunk_output_type: console --- ## Data Below the Limit of Quantification Oftentimes, we have observations that are below the limit of quantification (BLOQ), meaning that the assay has only been validated down to a certain value, (the lower limit of quantification, LLOQ). A [paper by Stuart Beal](https://link.springer.com/content/pdf/10.1023/A:1012299115260.pdf) goes through 7 methods of handling this BLOQ data in NONMEM. A [more recent paper](https://pubmed.ncbi.nlm.nih.gov/40067154/) introduces two methods that inflate the residual error for BLOQ values. In my experience, I've seen M1 (dropping any BLOQ values completely), M5 (replace BLOQ values with $LLOQ/2$), and M6 (in a series of consecutive BLOQ values, impute the BLOQ value closes to the maximum with $LLOQ/2$ and drop the rest) most commonly in the pharmacometrics world. M3 (treat the BLOQ values as left-censored data) is very close to being technically correct (it is tecnically for lognormal error models) and is the gold standard, producing less-biased results than the above imputation methods, but it often has trouble converging with optimization methods. M4 (treat the BLOQ values as left-censored data and truncated below at 0) is the most technically correct, but I've never seen a single usage in practice. I tried it in NONMEM once, and it was a disaster. As mentioned, M3 and M4 suffer from numerical issues with optimization methods. However, there are no such issues when performing MCMC sampling in Stan, so I write all of my models to use M4 if the error model is a distribution that is typically unbounded (Gaussian, Student's-t, ...), and M3 if the error model is naturally bounded below at 0 (lognormal - M3 and M4 are equivalent in this case). ## Treat the BLOQ values as Left-Censored Data (M3) {#sec-m3-model} Instead of tossing out the BLOQ data (M1) or assigning them some arbitrary value (M5-M7), we should keep them in the data set and treat them as left-censored data. This means that the likelihood contribution for observation $y_{ij}$ is calculated differently for observed values than for BLOQ values: \begin{align} \mbox{observed data} &- f\left(y_{ij} \, | \, \theta_i, \sigma, t_{ij} \right) \notag \\ \mbox{BLOQ data} &- F\left(LLOQ \, | \, \theta_i, \sigma, t_{ij} \right) \notag \\ \end{align} where $f\left(y_{ij} \, | \, \theta_i, \sigma, t_{ij} \right)$ is the density (pdf) and $F\left(LLOQ \, | \, \theta_i, \sigma, t_{ij} \right) = P\left(c_{ij} \leq LLOQ\, | \, \theta_i, \sigma, t_{ij} \right)$ is the cumulative distribution function (cdf). ### Stan Code Example for M3 {#sec-stan-m3} I've mostly written Stan models on this site that fit the model with [within-chain parallelization](/Content/Tutorials/Stan/within_chain_parallelization.qmd), but I'll demonstrate the concept with code without within-chain parallelization^[the code is actually very similar. It's more a matter of *where* it goes - if there's no within-chain parallelization, it goes in the `model` block. If there is, then it goes into you `partial_sum` function in the `functions` block.]. First, let's imagine something simple (like in the [simple linear regression](/Content/Fundamentals/Bayes/what_is_going_on_slr.qmd) example) where there are no BLOQs, and we can just write the likelihood just how we would write it on paper: ```stan model{ // Priors ... // Likelihood y ~ normal(ipred, sigma); } ``` or we could equivalently incorporate the likelihood by using the `target +=` syntax: ```stan model{ // Priors ... // Likelihood target += normal_lpdf(y | ipred, sigma); } ``` The above likelihood is *vectorized*. We can equivalently write it with a `for` loop: ```stan model{ // Priors ... // Likelihood for(i in 1:n){ target += normal_lpdf(y[i] | mu[i], sigma); } } ``` Now if we have BLOQs^[Having `lloq[i]` allows for a dataset where some observations have different LLOQs than others], we just use the `target +=` syntax and give the likelihood for each observation taking into account whether it is BLOQ or not. ```stan model{ // Priors ... // Likelihood for(i in 1:n) if(bloq[i] == 1){ target += normal_lcdf(lloq[i] | mu[i], sigma); }else{ target += normal_lpdf(y[i] | mu[i], sigma); } } } ``` These lines implement the math (on the log scale, since Stan calculates the *log-posterior*). \begin{align} \mathtt{normal\_lcdf(x)} &= log(F(x)), \notag \\ \mathtt{normal\_lpdf(x)} &= log(f(x)) \notag \\ \end{align} where $f(\cdot)$ and $F(\cdot)$ are the normal density and cumulative distribution functions, respectively^[For more information, see the Stan documentation for [normal_lcdf()](https://mc-stan.org/docs/functions-reference/normal-distribution.html), and [normal_lpdf()](https://mc-stan.org/docs/functions-reference/normal-distribution.html).]. ## Treat the BLOQ values as Left-Censored Data and Truncated Below at 0 (M4) {#sec-m4-model} We know that drug concentrations cannot be $< 0$, but the normal distribution has support over ($-\infty, \, \infty$)[^poppk-15], so we will assume a normal distribution *truncated below at 0.* This will have the effect of limiting the support of our assumed distribution to $(0, \, \infty)$. Since we're assuming a truncated distribution, we need to adjust the likelihood contributions of our data[^poppk-16]: \begin{align} \mbox{observed data} &- \frac{f\left(y_{ij} \, | \, \theta_i, \sigma, t_{ij} \right)}{1 - F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right)} \\ \mbox{BLOQ data} &- \frac{F\left(LLOQ \, | \, \theta_i, \sigma, t_{ij} \right) - F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right)}{1 - F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right)} \\ \end{align} [^poppk-15]: A model that assumes *log-normal* error has support over $(0, \, \infty)$, so truncation is not an issue. In that case, M3 and M4 are equivalent. Mathematically, you can see this by noting that $F\left(0 \, | \, \theta_i, \sigma, t_{ij} \right) = 0$. [^poppk-16]: For observed data with this truncated distribution, we need to "correct" the density so it integrates to 1. Division by $1 - F(\cdot \, | \, \cdot)$ has this effect. For the censored data, the numerator is similar to the M3 method, but we must also account for the fact that it must be $>0$, hence $P(0 \leq y_{ij} \leq LLOQ) = F(LLOQ) - F(0)$. The denominator is corrected in the same manner as for the observed data. ### Stan Code Example for M4 Now that you've seen the steps to go from the `~` operator to `target +=` to a `for` loop in @sec-stan-m3, we'll just go ahead and write the M4 implementation: ```stan model{ // Priors ... // Likelihood for(i in 1:n) if(bloq[i] == 1){ target += log_diff_exp(normal_lcdf(lloq[i] | mu[i], sigma), normal_lcdf(0.0 | mu[i], sigma)) - normal_lccdf(0.0 | mu[i], sigma); }else{ target += normal_lpdf(y[i] | mu[i], sigma) - normal_lccdf(0.0 | mu[i], sigma); } } } ``` These lines implement the math (on the log scale, since Stan calculates the *log-posterior*). \begin{align} \mathtt{log\_diff\_exp(normal\_lcdf(lloq), normal\_lcdf(0))} &= log(F(lloq) - F(0)), \notag \\ \mathtt{normal\_lcdf(x)} &= log(F(x)), \notag \\ \mathtt{normal\_lccdf(x)} &= log(1 - F(x)), \notag \\ \mathtt{normal\_lpdf(x)} &= log(f(x)) \notag \\ \end{align} where $f(\cdot)$ and $F(\cdot)$ are the normal density and cumulative distribution functions, respectively^[For more information, see the Stan documentation for [log_diff_exp()](https://mc-stan.org/docs/functions-reference/composed-functions.html), [normal_lcdf()](https://mc-stan.org/docs/functions-reference/normal-distribution.html), [normal_lccdf()](https://mc-stan.org/docs/functions-reference/normal-distribution.html), and [normal_lpdf()](https://mc-stan.org/docs/functions-reference/normal-distribution.html).].