How to Interpret (Frequentist) Statistical Estimates in Medical Research

[This is the English translation of an article originally published on SIAF Community, the Italian scientific platform for forensic and insurance medicine]

Statistical methods are used in medical research to estimate the effects of treatments or health conditions in populations. For example, when testing the effectiveness of a new treatment that has shown very positive results in preclinical animal studies, researchers draw from the target clinical population a small group of patients - called a statistical sample. The ideal goal is to administer the treatment to the patients in the sample to estimate its effectiveness in the entire population. The process by which the effect observed in the sample is “transported” to the wider population is called inference. However, even when the best inferential methods available are used, a perfect transport is factually impossible. The reasons range from the many uncertainties involved in conducting a clinical study to the variability of conditions under which the treatment is administered in the “real world” (differences in hospital protocols, disparities in resources between clinics, and so on).

For this reason, a statistical estimate is never an absolute truth but rather a reasoned guess on what the effect in the population might be, based on what was observed in the sample. In most cases, what is evaluated is not the effect on individual patients but the average effect in the sample. This measure is called a point estimate and represents the best available guess of the average effect in the entire population. Naturally, the more accurate the study, the more reliable the guess. Nevertheless, every estimate carries a minimum margin of uncertainty - 'minimum' meaning that at least that much exists, though likely more. This uncertainty can be interpreted as a degree of imprecision in the result, according to the specific statistical method used. The goal is to assess which other possible average effects are reasonably compatible with what was observed in the study, as judged by that method.

Let’s consider a concrete example: suppose we want to estimate the average effect of an antihypertensive treatment in a given population of interest. After administering the treatment to a sample appropriately drawn from that population, we observe an average change in diastolic blood pressure of 0 mmHg. This means that the joint effect of the treatment and all associated procedures (patient adherence to therapy, administration methods, measurements, data collection, etc.) produced a point estimate of 0 mmHg. It is therefore important to note that we are not quantifying the “pure” average effect of the therapy alone, but rather the overall effect of the therapy plus the entire experimental process. This is why we say that a point estimate reflects the whole data-generating process. Caution is thus required: every number we obtain is the result of a context, not of a single isolated phenomenon.

Having made this essential clarification, we can now ask how precise our point estimate is. The answer depends on the statistical method used to quantify imprecision - often, several defensible options exist. Suppose that, using one such method, we obtain a minimum uncertainty interval ranging from −5 mmHg to +5 mmHg. This means that, according to that method, all average effects between a mean reduction of 5 mmHg and a mean increase of 5 mmHg are reasonably consistent with what was observed in the experiment. In other words, the chosen method tells us: “The best guess is 0 mmHg; however, all the guesses ranging from an average effect of −5 mmHg to +5 mmHg deserve consideration according to my evaluation.”

Traditionally, this interval is called a “confidence” interval. However, the term “confidence” is misleading because it suggests that we should have confidence in the values inside the interval. In reality, scientific confidence in a result requires far more than a statistical estimate: it entails validation of every stage of the study, which in most cases cannot be achieved exhaustively. It is therefore preferable to speak of a compatibility interval: the interval that collects a set of hypotheses about the average effect that are reasonably compatible with what was observed in the experiment, as evaluated by the statistical method employed.

To appreciate how the concept of compatibility is weaker and more moderate than other notions such as confidence, plausibility, or support, consider the following analogy: finding a person at the scene of a crime is equally compatible with the hypothesis of guilt (e.g., being the perpetrator) and the hypothesis of assistance (e.g., rushing to help). To support one hypothesis over the other, more specific evidence is needed than mere compatibility. Similarly, claiming the presence or absence of a causal effect requires much stronger evidence than a simple statistical estimate showing an interval of hypotheses reasonably compatible with the data. Such evidence includes careful methodological validation and assessments of biological plausibility.

For a more detailed discussion and clarification of what “reasonably compatible” means, see the following works:

Rovetta, A., Mansournia, M. A., Stovitz, S. D., Adams, W. M., & Greenland, S. (2025). Interpreting p values and interval estimates based on practical relevance: guidance for the sports medicine clinician. British journal of sports medicine, bjsports-2024-109357. https://doi.org/10.1136/bjsports-2024-109357

Rovetta, A., Piretta, L., & Mansournia, M. A. (2025). p-Values and confidence intervals as compatibility measures: guidelines for interpreting statistical studies in clinical research. The Lancet regional health. Southeast Asia, 33, 100534. https://doi.org/10.1016/j.lansea.2025.100534

Vitale, A., Mansournia, M. A., & Rovetta, A. (2025). Why is p-Value Controversial?. Cardiovascular and interventional radiology, 10.1007/s00270-025-04139-y. https://doi.org/10.1007/s00270-025-04139-y (free text here: https://rdcu.be/exqjn)

For an in-depth discussion of P-values as measures of compatibility, S-values as measures of incompatibility (refutational information), and likelihood, see the following work:

Rafi, Z., & Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC medical research methodology, 20(1), 244. https://doi.org/10.1186/s12874-020-01105-9



Comments

Popular posts from this blog

Reification: “The data speak for themselves” only if we confuse mathematics with reality

On the various definitions of P-value