[This is the English translation of an article originally published on SIAF Community, the Italian scientific platform for forensic and insurance medicine]
Suppose we flip a coin and ask what the probability of getting heads is. Our experience and what we learned in school suggest that the answer is 50%, since there is one “favorable” case (heads) out of two possible outcomes (heads or tails).
This is a very common form of reification: confusing a theoretical abstraction - in this case, the probabilistic model that assigns a 50% chance to heads - with an empirical property of the real world, as if that 50% were an intrinsic characteristic of the coin or of the physical act of flipping it. Contrary to what one might think, this almost automatic answer already assumes a very specific statistical model, based on at least two main assumptions about the process that generates the outcomes (where an outcome is obtaining 'heads' or 'tails').
Assumption of the competent manufacturer: the coin is not physically unbalanced (that is, it is not rigged).
Assumption of the “honest” flipper: the way a balanced coin is flipped does not systematically favor heads or tails.
Once these two essential assumptions - that is, our model - are satisfied, we can claim with “reasonable” confidence that the process producing the outcomes is sufficiently random. This is because we expect the physical process of flipping the coin, which includes micro-differences in the applied force, release point, and the coin’s interactions with the air and the surface it lands on, to behave in an “unpredictable” way on each individual flip but overall symmetrically, without systematically favoring one side and without depending meaningfully on the previous outcome. Under these conditions, the theoretical frequency 1/2 can serve as a basis for formulating “plausible” probabilistic predictions about what will happen after many flips: overall, we expect that about 50% of the outcomes will be heads.
In model terms, the parameter representing the probability of heads equals 1/2; if the model is at least approximately adequate, we expect the observed frequency of heads to approach that value as the number of flips increases. However, even an excellent agreement between data and model does not demonstrate that the model is correct; it only indicates that, so far, the observations do not contradict it [1]. In fact, those same observations could also be explained by other mechanisms. For example, a skilled magician could flip a rigged coin in such a way as to produce roughly 50% heads and 50% tails. In that case, the data would appear perfectly compatible with the fair-coin model, even though they result from an entirely different process. Or consider obtaining a perfectly alternating sequence of heads and tails (heads, then tails, then heads, then tails, and so on). This scenario would fully satisfy the numerical condition that 50% of the outcomes are heads; nonetheless, such a pattern would raise strong doubts about whether the flips are truly unpredictable on each individual outcome and essentially independent.
The coin example shows that, even in the simplest cases, concepts such as 'expected probability' or 'statistical estimate' are not “properties of reality”: they depend heavily on logical-mathematical deductions grounded in a set of assumptions about how that reality behaves. Unfortunately, everyday experience leads us to bury those assumptions in a latent cognitive layer, to the point of forgetting them [2]. After all, in contexts that are so repetitive, it would be impractical to restate the assumptions underlying the model every time. As a result, our mental processing leads us to focus (almost) exclusively on the main expectation (in this case, the 50% probability), while the reasons why we formed that expectation fade into the background until they disappear.
This is why, when someone asks us the probability of getting heads when flipping a coin, we instinctively answer “50%” without requesting further details: to make life easier, we omit the most important part of our reasoning, namely the assumptions that make that answer meaningful. Sadly, this same unconscious mechanism has affected the biomedical sciences for at least a century, leading to major distortions in both the production and interpretation of scientific evidence, as well as to the adoption of statistical rituals lacking methodological foundation [3]. Many describe this issue as one of the most pervasive and persistent in the history of science [4]. Indeed, any method used to estimate an effect (such as a drug’s effectiveness) is reliable only to the extent that the assumptions supporting it are reliable [1]. For example, the “standard model” of a randomized clinical trial, at the time of analysis, takes for granted that randomization worked adequately, that no systematic deviations occurred before or after patient enrollment, and that no relevant interactions were overlooked. Since such assumptions are far from guaranteed in practice [1,4], a substantial part of research should consist of two activities: on the one hand, identifying a family of models that describes as faithfully as possible the process that generates the data, that is, the real phenomenon with all its uncertainties; on the other hand, making explicit the uncertainties that cannot be modeled. However, this rarely happens in a thorough way [2-4].
In fact, some even speak of “objective data” or “self-evident data”, expressions that reflect not only severe reification but also a deep ignorance of the foundations of statistical science or a strong ideological commitment (the intent to construct skewed narratives to support a cause seen as beneficial to oneself or one’s social group). As widely documented in the literature, statistical estimates are the product of analysts' choices and actions; and often, different analysts can obtain very different, or even sharply contrasting, results starting from the same data set (even when competence, honesty, neutrality, and transparency are held constant) [1,4]. This occurs because, in most situations, multiple plausible models exist - or at least several models that are reasonably compatible with the phenomenon being described [4]. The most widely publicized example in recent years involves 246 biologists divided into 173 teams, who produced highly divergent or even contradictory estimates while analyzing the same data set [5]. But that is not all: the data themselves are the product of the choices and actions taken in designing and executing the experiment [1,6]. This point is so crucial that modern epidemiology often describes treatment effects through the concept of a joint intervention: the biological effect of the treatment combined with the specific experimental context in which it is assessed [4,7].
In conclusion, we must remember that science is not an objective discipline but a social subsystem heavily influenced by “human” factors such as economics, politics, and ideology [8,9]. Therefore, qualities like honesty, neutrality, and transparency are as important as competence and serve as substitutes for the impossible demand for objectivity [1,4,10,11]. In this context, acknowledging the limitations of statistics (and of methodology more broadly, including protocols and guidelines) is an act of responsibility aimed at safeguarding public health and scientific credibility [9]. Results - including those of clinical trials - are not definitive demonstrations but, at best (when there is careful investigation of causal mechanisms), reasoned bets [1,4,9].
Quoting George Box: “All models are wrong, some are useful.”
References
1. Rovetta, A., Mansournia, M. A., Stovitz, S. D., Adams, W. M., & Greenland, S. (2025). Interpreting p values and interval estimates based on practical relevance: guidance for the sports medicine clinician. British journal of sports medicine, bjsports-2024-109357. Advance online publication. https://doi.org/10.1136/bjsports-2024-109357
2. McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon Statistical Significance. The American Statistician, 73(sup1), 235–245. https://doi.org/10.1080/00031305.2018.1527253
3. Gigerenzer, G. (2018). Statistical rituals: The replication delusion and how we got there. Advances in Methods and Practices in Psychological Science, 1(2), 198–218. https://doi.org/10.1177/2515245918771329
4. Greenland, S. (2025). Statistical Methods: Basic Concepts, Interpretations, and Cautions. In: Ahrens, W., Pigeot, I. (eds) Handbook of Epidemiology. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6625-3_54-1
5. Oza A. (2023). Reproducibility trial: 246 biologists get different results from same data sets. Nature, 622(7984), 677–678. https://doi.org/10.1038/d41586-023-03177-1
6. Greenland, S. (2022). The causal foundations of applied probability and statistics. In Probabilistic and causal inference: The works of Judea Pearl (pp. 605-624). Association for Computing Machinery. https://doi.org/10.1145/3501714.3501747
7. Dahabreh, I. J., & Hernán, M. A. (2019). Extending inferences from a randomized trial to a target population. European journal of epidemiology, 34(8), 719–722. https://doi.org/10.1007/s10654-019-00533-2
8. Hennig, C. (2010). Mathematical models and reality: A constructivist perspective. Foundations of Science, 15, 29–48. https://doi.org/10.1007/s10699-009-9167-x
9. Bann, D., Courtin, E., Davies, N. M., & Wright, L. (2024). Dialling back ‘impact’ claims: researchers should not be compelled to make policy claims based on single studies. International journal of epidemiology, 53(1), dyad181. https://doi.org/10.1093/ije/dyad181
10. Greenland S. (2012). Transparency and disclosure, neutrality and balance: shared values or just shared words?. Journal of epidemiology and community health, 66(11), 967–970. https://doi.org/10.1136/jech-2011-200459
11. Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180(4), 967–1033. https://doi.org/10.1111/rssa.12276
Comments
Post a Comment