You may be asked to determine statistical confidence in your industrial hygiene assessment and sampling results. So what do you do if some or many of your results are less than the analytical detection limit (<LOD)? What value do you use? To get started, you would conduct ‘data imputation’ which is essentially replacing missing data with a specific value. And how is that done – how do you know which value to use? Let’s discuss.
There are several methods available and often used when statistically analyzing your industrial hygiene dataset and in particular, those values determined to be less than the limit of detection. It should be noted that a dataset with a relatively large number of values, or the majority of the dataset, which are reported as less than the detection limit (<LOD) are referred to as a left-censored data.
You may use a value equivalent to half of the limit of detection value (LOD/2), or in other words, the limit of detection divided by two. A drawback to the method is that while your limit of detection for the mass of an analyte remains constant from one sample to the next, the sample volume does not, and therefore the reported sample concentration (mass divided by volume) may not be the same for every sample.
For example, if you have received laboratory results for Samples 1, 2, 3 and 4 of <0.025 mg/m3, <0.026 mg/m3, 0.055 mg/m3 and <0.024 mg/m3, respectively, you would use the values of 0.0125 mg/m3, 0.013 mg/m3, 0.055 mg/m3 and 0.012 mg/m3 in your calculations (LOD/2).
A second approach is to use a value equivalent to the detection limit divided by the square root of 2 (LOD/SQRT(2)). The results obtained will be similar to those above. If we received the same sample results from the laboratory as above, the values used in statistical calculations for this method would be 0.0177 mg/m3, 0.0184 mg/m3, 0.055 mg/m3 and 0.0170 mg/m3. These values will obviously tend to be slightly higher that the method above, as the square root of 2 is roughly equivalent to 1.4142.
The U.S. Environmental Protection Agency (EPA) has provided guidance as to which of the two methods to follow for given degrees of variability in the data set in the document Guidelines for Statistical Analysis of Occupational Exposure Data.
In the document, the following guidelines are provided, referring to analysis of non-detect data:
The two methods are:
1) If the geometric standard deviation of the monitoring data set is less than 3.0, non-detectable values should be replaced by the limit of detection divided by the square root of two (L//2).
2) If the data are highly skewed, with a geometric standard deviation of 3.0 or greater, nondetectable values should be replaced by half the detection limit (L/2).
If 50% or more of the monitoring data are non-detectable, substitution of any value for these data will result in biased estimates of the geometric mean and the geometric standard deviation. If it is necessary to calculate statistics using data sets with such a large proportion of non-detectable data, the potential biases introduced by these calculations should be described when presenting the results of the analyses.
By extension, knowing which method to follow requires an idea of the variability of the dataset and calculation of the geometric standard deviation.
The American Industrial Hygiene Association (AIHA) has produced a substantial guideline text on conducting exposure assessments, titled Strategies for Assessing and Managing Occupational Exposures. In the text, they provide direction as to which method to follow and similar to the EPA, the method used depends on the variability of the dataset. The AIHA text recommends the following:
For the sake of discussion, four levels of censoring can be defined:
• low — < 20% censored
• medium — 20% to 50% censored
• high — 50% to 80% censored
• severe — 80% to 100% censored.
If less than 20% of the data are censored, the simple substitution methods can be used without much error for generating estimates of the GM and GSD. For 20 to 50% censored data, the log probit regression method will generally result in more accurate parameter estimates provided the sample size is fairly high (i.e., n > 15). The MLE method can be considered a nearly universal method, suitable for low to highly-censored datasets and all but the smallest sample sizes (e.g., n < 10). But as Hornung and Reed(1) warned, parameters estimated from highly or severely censored datasets should not be expected to be very accurate, and should be used with caution.
The log probit regression analysis can help estimate the degree of variability for censored datasets. Maximum likelihood estimation (MLE) techniques are also beneficial in estimating the variability of datasets but require more involved calculation techniques than simple substitution.
If the above methods are not sufficient, it may also be possible to exclude the censored data from analysis, to use a value equivalent to the detection limit, or to use a value of zero in calculations. There is less support for these approaches so the rationale for utilizing them must justify the approach.
Each guideline on the statistical analysis of data makes the same warning – heavily censored datasets are rough estimates and the conclusions of analysis should be taken with a grain of salt. The methods are basically ways of working with data if you find yourself with large numbers of censored data.
Care should also be taken when comparing the detection limit against the exposure limit. If the analytical detection limit is likely to cause issues in getting results which are below the exposure limit or action limit, care should be taken to collect adequate sample volumes and to maximize sample flow rates where possible. If you have already collected the samples before considering how the results will compare to the exposure limit, you may be unable to make conclusions on the potential for worker overexposure if the results are in the format of exposure limit < limit of detection.