Statistical Framework: Estimating the Cumulative Shares of Nobel Prizes from 1901 to 2022
On the (Apparently) Paradoxical Role of Noise in the Recognition of Signal Character of Minor Principal Components

Journal Description

Stats

Stats is an international, peer-reviewed, open access journal on statistical science published quarterly online by MDPI. The journal focuses on methodological and theoretical papers in statistics, probability, stochastic processes and innovative applications of statistics in all scientific disciplines including biological and biomedical sciences, medicine, business, economics and social sciences, physics, data science and engineering.

Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
High Visibility: indexed within ESCI (Web of Science), Scopus, RePEc, and other databases.
Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 15.8 days after submission; acceptance to publication is undertaken in 3.8 days (median values for papers published in this journal in the second half of 2023).
Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.

Impact Factor: 1.3 (2022); 5-Year Impact Factor: 1.2 (2022)

Imprint Information Journal Flyer Open Access ISSN: 2571-905X

Latest Articles

17 pages, 3964 KiB

Open AccessArticle

Residual Analysis for Poisson-Exponentiated Weibull Regression Models with Cure Fraction

by Cleanderson R. Fidelis, Edwin M. M. Ortega and Gauss M. Cordeiro

Stats 2024, 7(2), 492-508; https://doi.org/10.3390/stats7020030 - 20 May 2024

Abstract

The use of cure-rate survival models has grown in recent years. Even so, proposals to perform the goodness of fit of these models have not been so frequent. However, residual analysis can be used to check the adequacy of a fitted regression model. In this context, we provide Cox–Snell residuals for Poisson-exponentiated Weibull regression with cure fraction. We developed several simulations under different scenarios for studying the distributions of these residuals. They were applied to a melanoma dataset for illustrative purposes. Full article

11 pages, 1044 KiB

Open AccessCase Report

Testing for Level–Degree Interaction Effects in Two-Factor Fixed-Effects ANOVA When the Levels of Only One Factor Are Ordered

by J. C. W. Rayner and G. C. Livingston, Jr.

Stats 2024, 7(2), 481-491; https://doi.org/10.3390/stats7020029 - 15 May 2024

Abstract

In testing for main effects, the use of orthogonal contrasts for balanced designs with the factor levels not ordered is well known. Here, we consider two-factor fixed-effects ANOVA with the levels of one factor ordered and one not ordered. The objective is to extend the idea of decomposing the main effect to decomposing the interaction. This is achieved by defining level–degree coefficients and testing if they are zero using permutation testing. These tests give clear insights into what may be causing a significant interaction, even for the unbalanced model. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

19 pages, 4945 KiB

Open AccessArticle

Multivariate Time Series Change-Point Detection with a Novel Pearson-like Scaled Bregman Divergence

by Tong Si, Yunge Wang, Lingling Zhang, Evan Richmond, Tae-Hyuk Ahn and Haijun Gong

Stats 2024, 7(2), 462-480; https://doi.org/10.3390/stats7020028 - 13 May 2024

Abstract

Change-point detection is a challenging problem that has a number of applications across various real-world domains. The primary objective of CPD is to identify specific time points where the underlying system undergoes transitions between different states, each characterized by its distinct data distribution. Precise identification of change points in time series omics data can provide insights into the dynamic and temporal characteristics inherent to complex biological systems. Many change-point detection methods have traditionally focused on the direct estimation of data distributions. However, these approaches become unrealistic in high-dimensional data analysis. Density ratio methods have emerged as promising approaches for change-point detection since estimating density ratios is easier than directly estimating individual densities. Nevertheless, the divergence measures used in these methods may suffer from numerical instability during computation. Additionally, the most popular

α

-relative Pearson divergence cannot measure the dissimilarity between two distributions of data but a mixture of distributions. To overcome the limitations of existing density ratio-based methods, we propose a novel approach called the Pearson-like scaled-Bregman divergence-based (PLsBD) density ratio estimation method for change-point detection. Our theoretical studies derive an analytical expression for the Pearson-like scaled Bregman divergence using a mixture measure. We integrate the PLsBD with a kernel regression model and apply a random sampling strategy to identify change points in both synthetic data and real-world high-dimensional genomics data of Drosophila. Our PLsBD method demonstrates superior performance compared to many other change-point detection methods. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

17 pages, 293 KiB

Open AccessArticle

Multivariate and Matrix-Variate Logistic Models in the Real and Complex Domains

by A. M. Mathai

Stats 2024, 7(2), 445-461; https://doi.org/10.3390/stats7020027 - 11 May 2024

Abstract

Several extensions of the basic scalar variable logistic density to the multivariate and matrix-variate cases, in the real and complex domains, are given where the extended forms end up in extended zeta functions. Several cases of multivariate and matrix-variate Bayesian procedures, in the real and complex domains, are also given. It is pointed out that there are a range of applications of Gaussian and Wishart-based matrix-variate distributions in the complex domain in multi-look data from radar and sonar. It is hoped that the distributions derived in this paper will be highly useful in such applications in physics, engineering, statistics and communication problems, because, in the real scalar case, a logistic model is seen to be more appropriate compared to a Gaussian model in many industrial applications. Hence, logistic-based multivariate and matrix-variate distributions, especially in the complex domain, are expected to perform better where Gaussian and Wishart-based distributions are currently used. Full article

11 pages, 1897 KiB

Open AccessBrief Report

Bayesian Inference for Multiple Datasets

by Renata Retkute, William Thurston and Christopher A. Gilligan

Stats 2024, 7(2), 434-444; https://doi.org/10.3390/stats7020026 - 10 May 2024

Abstract

Estimating parameters for multiple datasets can be time consuming, especially when the number of datasets is large. One solution is to sample from multiple datasets simultaneously using Bayesian methods such as adaptive multiple importance sampling (AMIS). Here, we use the AMIS approach to fit a von Mises distribution to multiple datasets for wind trajectories derived from a Lagrangian Particle Dispersion Model driven from 3D meteorological data. A posterior distribution of parameters can help to characterise the uncertainties in wind trajectories in a form that can be used as inputs for predictive models of wind-dispersed insect pests and the pathogens of agricultural crops for use in evaluating risk and in planning mitigation actions. The novelty of our study is in testing the performance of the method on a very large number of datasets (>11,000). Our results show that AMIS can significantly improve the efficiency of parameter inference for multiple datasets. Full article

(This article belongs to the Section Bayesian Methods)

► Show Figures

Figure 1

32 pages, 10179 KiB

Open AccessArticle

Contrastive Learning Framework for Bitcoin Crash Prediction

by Zhaoyan Liu, Min Shu and Wei Zhu

Stats 2024, 7(2), 402-433; https://doi.org/10.3390/stats7020025 - 8 May 2024

Abstract

Due to spectacular gains during periods of rapid price increase and unpredictably large drops, Bitcoin has become a popular emergent asset class over the past few years. In this paper, we are interested in predicting the crashes of Bitcoin market. To tackle this task, we propose a framework for deep learning time series classification based on contrastive learning. The proposed framework is evaluated against six machine learning (ML) and deep learning (DL) baseline models, and outperforms them by 15.8% in balanced accuracy. Thus, we conclude that the contrastive learning strategy significantly enhance the model’s ability of extracting informative representations, and our proposed framework performs well in predicting Bitcoin crashes. Full article

► Show Figures

Figure 1

13 pages, 445 KiB

Open AccessArticle

On Non-Occurrence of the Inspection Paradox

by Diana Rauwolf and Udo Kamps

Stats 2024, 7(2), 389-401; https://doi.org/10.3390/stats7020024 - 24 Apr 2024

Abstract

The well-known inspection paradox or waiting time paradox states that, in a renewal process, the inspection interval is stochastically larger than a common interarrival time having a distribution function F, where the inspection interval is given by the particular interarrival time containing the specified time point of process inspection. The inspection paradox may also be expressed in terms of expectations, where the order is strict, in general. A renewal process can be utilized to describe the arrivals of vehicles, customers, or claims, for example. As the inspection time may also be considered a random variable T with a left-continuous distribution function G independent of the renewal process, the question arises as to whether the inspection paradox inevitably occurs in this general situation, apart from in some marginal cases with respect to F and G. For a random inspection time T, it is seen that non-trivial choices lead to non-occurrence of the paradox. In this paper, a complete characterization of the non-occurrence of the inspection paradox is given with respect to G. Several examples and related assertions are shown, including the deterministic time situation. Full article

(This article belongs to the Section Applied Stochastic Models)

► Show Figures

Figure 1

17 pages, 1471 KiB

Open AccessArticle

New Goodness-of-Fit Tests for the Kumaraswamy Distribution

by David E. Giles

Stats 2024, 7(2), 373-388; https://doi.org/10.3390/stats7020023 - 22 Apr 2024

Abstract

The two-parameter distribution known as the Kumaraswamy distribution is a very flexible alternative to the beta distribution with the same (0,1) support. Originally proposed in the field of hydrology, it has subsequently received a good deal of positive attention in both the theoretical and applied statistics literatures. Interestingly, the problem of testing formally for the appropriateness of the Kumaraswamy distribution appears to have received little or no attention to date. To fill this gap, in this paper, we apply a “biased transformation” methodology to several standard goodness-of-fit tests based on the empirical distribution function. A simulation study reveals that these (modified) tests perform well in the context of the Kumaraswamy distribution, in terms of both their low size distortion and respectable power. In particular, the “biased transformation” Anderson–Darling test dominates the other tests that are considered. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

12 pages, 988 KiB

Open AccessArticle

Bayesian Mediation Analysis with an Application to Explore Racial Disparities in the Diagnostic Age of Breast Cancer

by Wentao Cao, Joseph Hagan and Qingzhao Yu

Stats 2024, 7(2), 361-372; https://doi.org/10.3390/stats7020022 - 19 Apr 2024

Abstract

A mediation effect refers to the effect transmitted by a mediator intervening in the relationship between an exposure variable and a response variable. Mediation analysis is widely used to identify significant mediators and to make inferences on their effects. The Bayesian method allows researchers to incorporate prior information from previous knowledge into the analysis, deal with the hierarchical structure of variables, and estimate the quantities of interest from the posterior distributions. This paper proposes three Bayesian mediation analysis methods to make inferences on mediation effects. Our proposed methods are the following: (1) the function of coefficients method; (2) the product of partial difference method; and (3) the re-sampling method. We apply these three methods to explore racial disparities in the diagnostic age of breast cancer patients in Louisiana. We found that African American (AA) patients are diagnosed at an average of 4.37 years younger compared with Caucasian (CA) patients (57.40 versus 61.77,

p <

0.0001). We also found that the racial disparity can be explained by patients’ insurance (12.90%), marital status (17.17%), cancer stage (3.27%), and residential environmental factors, including the percent of the population under age 18 (3.07%) and the environmental factor of intersection density (9.02%). Full article

(This article belongs to the Section Bayesian Methods)

► Show Figures

Figure 1

11 pages, 437 KiB

Open AccessArticle

Combined Permutation Tests for Pairwise Comparison of Scale Parameters Using Deviances

by Scott J. Richter and Melinda H. McCann

Stats 2024, 7(2), 350-360; https://doi.org/10.3390/stats7020021 - 28 Mar 2024

Abstract

Nonparametric combinations of permutation tests for pairwise comparison of scale parameters, based on deviances, are examined. Permutation tests for comparing two or more groups based on the ratio of deviances have been investigated, and a procedure based on Higgins’ RMD statistic was found to perform well, but two other tests were sometimes more powerful. Thus, combinations of these tests are investigated. A simulation study shows a combined test can be more powerful than any single test. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

17 pages, 2272 KiB

Open AccessArticle

A Note on Simultaneous Confidence Intervals for Direct, Indirect and Synthetic Estimators

by Christophe Quentin Valvason and Stefan Sperlich

Stats 2024, 7(1), 333-349; https://doi.org/10.3390/stats7010020 - 20 Mar 2024

Abstract

Direct, indirect and synthetic estimators have a long history in official statistics. While model-based or model-assisted approaches have become very popular, direct and indirect estimators remain the predominant standard and are therefore important tools in practice. This is mainly due to their simplicity, including low data requirements, assumptions and straightforward inference. With the increasing use of domain estimates in policy, the demands on these tools have also increased. Today, they are frequently used for comparative statistics. This requires appropriate tools for simultaneous inference. We study devices for constructing simultaneous confidence intervals and show that simple tools like the Bonferroni correction can easily fail. In contrast, uniform inference based on max-type statistics in combination with bootstrap methods, appropriate for finite populations, work reasonably well. We illustrate our methods with frequently applied estimators of totals and means. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

16 pages, 476 KiB

Open AccessArticle

The Flexible Gumbel Distribution: A New Model for Inference about the Mode

by Qingyang Liu, Xianzheng Huang and Haiming Zhou

Stats 2024, 7(1), 317-332; https://doi.org/10.3390/stats7010019 - 13 Mar 2024

Abstract

A new unimodal distribution family indexed via the mode and three other parameters is derived from a mixture of a Gumbel distribution for the maximum and a Gumbel distribution for the minimum. Properties of the proposed distribution are explored, including model identifiability and flexibility in capturing heavy-tailed data that exhibit different directions of skewness over a wide range. Both frequentist and Bayesian methods are developed to infer parameters in the new distribution. Simulation studies are conducted to demonstrate satisfactory performance of both methods. By fitting the proposed model to simulated data and data from an application in hydrology, it is shown that the proposed flexible distribution is especially suitable for data containing extreme values in either direction, with the mode being a location parameter of interest. Using the proposed unimodal distribution, one can easily formulate a regression model concerning the mode of a response given covariates. We apply this model to data from an application in criminology to reveal interesting data features that are obscured by outliers. Full article

(This article belongs to the Special Issue Bayes and Empirical Bayes Inference)

► Show Figures

Figure 1

16 pages, 788 KiB

Open AccessArticle

Wilcoxon-Type Control Charts Based on Multiple Scans

by Ioannis S. Triantafyllou

Stats 2024, 7(1), 301-316; https://doi.org/10.3390/stats7010018 - 7 Mar 2024

Abstract

In this article, we establish new distribution-free Shewhart-type control charts based on rank sum statistics with signaling multiple scans-type rules. More precisely, two Wilcoxon-type chart statistics are considered in order to formulate the decision rule of the proposed monitoring scheme. In order to enhance the performance of the new nonparametric control charts, multiple scans-type rules are activated, which make the proposed chart more sensitive in detecting possible shifts of the underlying distribution. The appraisal of the proposed monitoring scheme is accomplished with the aid of the corresponding run length distribution under both in- and out-of-control cases. Thereof, exact formulae for the variance of the run length distribution and the average run length (ARL) of the proposed monitoring schemes are derived. A numerical investigation is carried out and depicts that the proposed schemes acquire better performance towards their competitors. Full article

► Show Figures

Figure 1

17 pages, 844 KiB

Open AccessArticle

Cumulative Histograms under Uncertainty: An Application to Dose–Volume Histograms in Radiotherapy Treatment Planning

by Flavia Gesualdi and Niklas Wahl

Stats 2024, 7(1), 284-300; https://doi.org/10.3390/stats7010017 - 6 Mar 2024

Abstract

In radiotherapy treatment planning, the absorbed doses are subject to executional and preparational errors, which propagate to plan quality metrics. Accurately quantifying these uncertainties is imperative for improved treatment outcomes. One approach, analytical probabilistic modeling (APM), presents a highly computationally efficient method. This study evaluates the empirical distribution of dose–volume histogram points (a typical plan metric) derived from Monte Carlo sampling to quantify the accuracy of modeling uncertainties under different distribution assumptions, including Gaussian, log-normal, four-parameter beta, gamma, and Gumbel distributions. Since APM necessitates the bivariate cumulative distribution functions, this investigation also delves into approximations using a Gaussian or an Ali–Mikhail–Haq Copula. The evaluations are performed in a one-dimensional simulated geometry and on patient data for a lung case. Our findings suggest that employing a beta distribution offers improved modeling accuracy compared to a normal distribution. Moreover, the multivariate Gaussian model outperforms the Copula models in patient data. This investigation highlights the significance of appropriate statistical distribution selection in advancing the accuracy of uncertainty modeling in radiotherapy treatment planning, extending an understanding of the analytical probabilistic modeling capacities in this crucial medical domain. Full article

(This article belongs to the Special Issue Advances in Probability Theory and Statistics)

► Show Figures

Figure 1

15 pages, 3284 KiB

Open AccessArticle

Comments on the Bernoulli Distribution and Hilbe’s Implicit Extra-Dispersion

by Daniel A. Griffith

Stats 2024, 7(1), 269-283; https://doi.org/10.3390/stats7010016 - 5 Mar 2024

Abstract

For decades, conventional wisdom maintained that binary 0–1 Bernoulli random variables cannot contain extra-binomial variation. Taking an unorthodox stance, Hilbe actively disagreed, especially for correlated observation instances, arguing that the universally adopted diagnostic Pearson or deviance dispersion statistics are insensitive to a variance anomaly in a binary context, and hence simply fail to detect it. However, having the intuition and insight to sense the existence of this departure from standard mathematical statistical theory, but being unable to effectively isolate it, he classified this particular over-/under-dispersion phenomenon as implicit. This paper explicitly exposes his hidden quantity by demonstrating that the variance in/deflation it represents occurs in an underlying predicted beta random variable whose real number values are rounded to their nearest integers to convert to a Bernoulli random variable, with this discretization masking any materialized extra-Bernoulli variation. In doing so, asymptotics linking the beta-binomial and Bernoulli distributions show another conventional wisdom misconception, namely a mislabeling substitution involving the quasi-Bernoulli random variable; this undeniably is not a quasi-likelihood situation. A public bell pepper disease dataset exhibiting conspicuous spatial autocorrelation furnishes empirical examples illustrating various features of this advocated proposition. Full article

► Show Figures

Figure 1

34 pages, 659 KiB

Open AccessArticle

Two-Stage Limited-Information Estimation for Structural Equation Models of Round-Robin Variables

by Terrence D. Jorgensen, Aditi M. Bhangale and Yves Rosseel

Stats 2024, 7(1), 235-268; https://doi.org/10.3390/stats7010015 - 28 Feb 2024

Abstract

We propose and demonstrate a new two-stage maximum likelihood estimator for parameters of a social relations structural equation model (SR-SEM) using estimated summary statistics (

\hat{Σ}

) as data, as well as uncertainty about

\hat{Σ}

to obtain robust inferential statistics. The [...] Read more.

We propose and demonstrate a new two-stage maximum likelihood estimator for parameters of a social relations structural equation model (SR-SEM) using estimated summary statistics (

\hat{Σ}

) as data, as well as uncertainty about

\hat{Σ}

to obtain robust inferential statistics. The SR-SEM is a generalization of a traditional SEM for round-robin data, which have a dyadic network structure (i.e., each group member responds to or interacts with each other member). Our two-stage estimator is developed using similar logic as previous two-stage estimators for SEM, developed for application to multilevel data and multiple imputations of missing data. We demonstrate out estimator on a publicly available data set from a 2018 publication about social mimicry. We employ Markov chain Monte Carlo estimation of

\hat{Σ}

in Stage 1, implemented using the R package rstan. In Stage 2, the posterior mean estimates of

\hat{Σ}

are used as input data to estimate SEM parameters with the R package lavaan. The posterior covariance matrix of estimated

\hat{Σ}

is also calculated so that lavaan can use it to calculate robust standard errors and test statistics. Results are compared to full-information maximum likelihood (FIML) estimation of SR-SEM parameters using the R package srm. We discuss how differences between estimators highlight the need for future research to establish best practices under realistic conditions (e.g., how to specify empirical Bayes priors in Stage 1), as well as extensions that would make 2-stage estimation particularly advantageous over single-stage FIML. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

15 pages, 2708 KiB

Open AccessArticle

Generation of Scale-Free Assortative Networks via Newman Rewiring for Simulation of Diffusion Phenomena

by Laura Di Lucchio and Giovanni Modanese

Stats 2024, 7(1), 220-234; https://doi.org/10.3390/stats7010014 - 24 Feb 2024

Abstract

By collecting and expanding several numerical recipes developed in previous work, we implement an object-oriented Python code, based on the networkX library, for the realization of the configuration model and Newman rewiring. The software can be applied to any kind of network and “target” correlations, but it is tested with focus on scale-free networks and assortative correlations. In order to generate the degree sequence we use the method of “random hubs”, which gives networks with minimal fluctuations. For the assortative rewiring we use the simple Vazquez-Weigt matrix as a test in the case of random networks; since it does not appear to be effective in the case of scale-free networks, we subsequently turn to another recipe which generates matrices with decreasing off-diagonal elements. The rewiring procedure is also important at the theoretical level, in order to test which types of statistically acceptable correlations can actually be realized in concrete networks. From the point of view of applications, its main use is in the construction of correlated networks for the solution of dynamical or diffusion processes through an analysis of the evolution of single nodes, i.e., beyond the Heterogeneous Mean Field approximation. As an example, we report on an application to the Bass diffusion model, with calculations of the time

t_{m a x}

of the diffusion peak. The same networks can additionally be exported in environments for agent-based simulations like NetLogo. Full article

► Show Figures

Figure 1

17 pages, 1866 KiB

Open AccessArticle

New Vessel Extraction Method by Using Skew Normal Distribution for MRA Images

by Tohid Bahrami, Hossein Jabbari Khamnei, Mehrdad Lakestani and B. M. Golam Kibria

Stats 2024, 7(1), 203-219; https://doi.org/10.3390/stats7010013 - 23 Feb 2024

Abstract

Vascular-related diseases pose significant public health challenges and are a leading cause of mortality and disability. Understanding the complex structure of the vascular system and its processes is crucial for addressing these issues. Recent advancements in medical imaging technology have enabled the generation of high-resolution 3D images of vascular structures, leading to a diverse array of methods for vascular extraction. While previous research has often assumed a normal distribution of image data, this paper introduces a novel vessel extraction method that utilizes the skew normal distribution for more accurate probability distribution modeling. The proposed method begins with a preprocessing step to enhance vessel structures and reduce noise in Magnetic Resonance Angiography (MRA) images. The skew normal distribution, known for its ability to model skewed data, is then employed to characterize the intensity distribution of vessels. By estimating the parameters of the skew normal distribution using the Expectation-Maximization (EM) algorithm, the method effectively separates vessel pixels from the background and non-vessel regions. To extract vessels, a thresholding technique is applied based on the estimated skew normal distribution parameters. This segmentation process enables accurate vessel extraction, particularly in detecting thin vessels and enhancing the delineation of vascular edges with low contrast. Experimental evaluations on a diverse set of MRA images demonstrate the superior performance of the proposed method compared to previous approaches in terms of accuracy and computational efficiency. The presented vessel extraction method holds promise for improving the diagnosis and treatment of vascular-related diseases. By leveraging the skew normal distribution, it provides accurate and efficient vessel segmentation, contributing to the advancement of vascular imaging in the field of medical image analysis. Full article

► Show Figures

Figure 1

18 pages, 1520 KiB

Open AccessArticle

Utility in Time Description in Priority Best–Worst Discrete Choice Models: An Empirical Evaluation Using Flynn’s Data

by Sasanka Adikari and Norou Diawara

Stats 2024, 7(1), 185-202; https://doi.org/10.3390/stats7010012 - 19 Feb 2024

Abstract

Discrete choice models (DCMs) are applied in many fields and in the statistical modelling of consumer behavior. This paper focuses on a form of choice experiment, best–worst scaling in discrete choice experiments (DCEs), and the transition probability of a choice of a consumer over time. The analysis was conducted by using simulated data (choice pairs) based on data from Flynn’s (2007) ‘Quality of Life Experiment’. Most of the traditional approaches assume the choice alternatives are mutually exclusive over time, which is a questionable assumption. We introduced a new copula-based model (CO-CUB) for the transition probability, which can handle the dependent structure of best–worst choices while applying a very practical constraint. We used a conditional logit model to calculate the utility at consecutive time points and spread it to future time points under dynamic programming. We suggest that the CO-CUB transition probability algorithm is a novel way to analyze and predict choices in future time points by expressing human choice behavior. The numerical results inform decision making, help formulate strategy and learning algorithms under dynamic utility in time for best–worst DCEs. Full article

(This article belongs to the Topic Interfacing Statistics, Machine Learning and Data Science from a Probabilistic Modelling Viewpoint)

► Show Figures

Figure 1

13 pages, 533 KiB

Open AccessArticle

Importance and Uncertainty of λ-Estimation for Box–Cox Transformations to Compute and Verify Reference Intervals in Laboratory Medicine

by Frank Klawonn, Neele Riekeberg and Georg Hoffmann

Stats 2024, 7(1), 172-184; https://doi.org/10.3390/stats7010011 - 9 Feb 2024

Cited by 1

Abstract

Reference intervals play an important role in medicine, for instance, for the interpretation of blood test results. They are defined as the central 95% values of a healthy population and are often stratified by sex and age. In recent years, so-called indirect methods for the computation and validation of reference intervals have gained importance. Indirect methods use all values from a laboratory, including the pathological cases, and try to identify the healthy sub-population in the mixture of values. This is only possible under certain model assumptions, i.e., that the majority of the values represent non-pathological values and that the non-pathological values follow a normal distribution after a suitable transformation, commonly a Box–Cox transformation, rendering the parameter

λ

of the Box–Cox transformation as a nuisance parameter for the estimation of the reference interval. Although indirect methods put high effort on the estimation of

λ

, they come to very different estimates for

λ

, even though the estimated reference intervals are quite coherent. Our theoretical considerations and Monte-Carlo simulations show that overestimating