On the various definitions of P-value

Reading one of the many wonderful insights by Professor Andrew Gelman on the Columbia University blog (see https://statmodeling.stat.columbia.edu/2023/04/14/4-different-meanings-of-p-value-and-how-my-thinking-has-changed), I have developed a series of thoughts that I believe - based on my current knowledge and abilities - might be potentially useful.

Original comments, part 1

Andrew Gelman: <<Definition 1. p-value(y) = Pr(T(y_rep) >= T(y) | H), where H is a “hypothesis,” a generative probability model, y is the observed data, y_rep are future data under the model, and T is a “test statistic,” some pre-specified specified function of data. [...]>> April 14, 2023 9:14 AM

Christian Hennig: <<By the way, on misinterpretation that bugs me and seems to come up often is the idea that there is a “true” unobserved p-value potentially different from the observed one in case the model doesn’t hold. Not so. The p-value measures the relation between the data and a specified model, and it does this regardless of whether the model is “true” or not.>> April 16, 2023 8:55 AM at 8:55 am

Andrew Gelman: <<Sander Greenland calls this sort of thing a “descriptive” p-value, capturing the idea that the p-value can be understood as a summary of the discrepancy or divergence of the data from H according to some measure, ranging from 0 = completely incompatible to 1 = completely compatible. [...] A p-value from Description 4 is unambiguously defined from existing formulas so is a clear data summary even if it can’t easily be interpreted as a probability in the context of the problem at hand. [...]>> April 14, 2023 9:14 AM

My comments, part 1

I find that Definition 1 involves interpretative risks. Specifically, instead of H, I believe it would be more appropriate to substitute M = A + H, where H is the target hypothesis and A is the set of statistical background assumptions (notation I have seen used by Sander Greenland, see https://doi.org/10.1111/sjos.12625, p. 26-27). This makes explicit what Christian supports: Indeed, the P-value measures the compatibility between the prediction of an entire model M and the experimental test statistic. Whether the model M is "true" or "false" in practice is not a concern of the P-value, which assumes a priori that both H and A are correct.

I partially agree with what Christian says, but I am inclined to share it only if the public clearly distinguishes the statistical-mathematical plane from the empirical-scientific one: Indeed, this regards essentially the internal consistency of a mathematical approach, capable of a sort of "self-description." A similar situation is found in general relativity with the concept of "general covariance" or "invariance under coordinate diffeomorphisms." The inherent risk, in my opinion, is the same: That the reader may attempt to draw practical conclusions from mere abstract flawlessness. In physics, the perfect relationship between coordinates and the metric tensor does not correspond to that between coordinates and the observable universe (space-time); similarly, in frequentist statistics, the perfect relationship between P-values and the mathematical model does not correspond to that between P-values and scientific hypotheses. Especially in contexts where stakeholders are highly exposed to costs and risks (e.g., public health), I suggest presenting the P-value as an effective descriptor only following a thorough examination of the underlying assumptions and meticulous execution of experimental protocols. Returning to what Christian mentioned, in this sense, it is justifiable to distinguish between observed P-values from unobserved ones: Calling R the set of real conditions (i.e., the unknown ideal model that perfectly describes the behavior of "chance" in reality), the central question is "how close does the observed P-value = Pr(t ≥ t(y) | A + H) approach the unobserved P-value = Pr(t ≥ t(y) | R + H)?"

Finally, Andrew argues that the divergence-descriptive P-value is not easily reconciled with the P-value as a probabilistic parameter. In this regard, I believe that surprisal - and hence the concept of information - serves as a perfect bridge between these two descriptions: Indeed, comparing the statistical outcome to a random event like fairly flipping a coin provides a clear idea of the amount of probabilistic information that the observed P-value entails (e.g., if s = 4, the result is as surprising as obtaining 4 consecutive heads when fairly flipping a coin 4 times). Moreover, the researcher is also perfectly aware that flipping a coin does not represent the scientific phenomenon they are investigating, i.e., the separation between the mathematical-statistical plane and the empirical-scientific one is clear.

Original comments, part 2

Martha K. Smith: <<One “misunderstanding” that bugs me is when I read a statement like, “This difference is statistically important”. I assume that what was going on in the author’s mind is that in ordinary language, “significant” and “important” are synonyms. [...]>> April 15, 2023 5:51 PM at 5:51 pm

Sander Greenland: “The field of statistics bears the responsibility for having co-opted ordinary words to label technical concepts only distantly related to the ordinary language meanings [...]”>> April 15, 2023 7:05 PM at 7:05 pm.

My comments, part 2

I fully agree and would like to add more. In my humble opinion, a potential modern fault in the statistical environment is failing to define the concept of "statistical significance" unambiguously. For example, the Cambridge Dictionary of Statistics (4th ed.) adopts this expression several times without ever explicitly explaining it. Even the great authors of the 1900s often omitted the adjective "statistical," leading the reader (and perhaps themselves?) to confuse the mathematical domain with the empirical one.

In addition, one of the most serious problems - in my view - is the use of the adjective "statistical" in the context of Neyman-Pearson. If we consider the neo-Fisherian framework, it is indeed possible to refer to statistical significance as the "amount" of statistical evidence against the null/target hypothesis (which is not without criticalities but is certainly sensible or, at least, defensible). On the contrary, in the Neyman-Pearson framework, the concept of significance is not statistical as it is unrelated to the outcome of the statistical test: It is a mere decision-frequentist significance, i.e., it is about adopting a rule of behavior in an attempt to limit the total number of false positives to a certain preset threshold after performing numerous equivalent replications. After all, I ask myself, how can the result of a statistical test be “statistically (non-)significant” based on a decision criterion that provides no information on the significance (meaning/importance/relevance) of such a statistical result? Thus, in my opinion, there is no way to conclude anything about that type of “long-run” significance in single studies. In other words, not reaching the level of decision-frequentist significance according to the Neyman-Pearson rule of behavior in no way means that the result is non-significant (regardless of the type of significance considered): This is only an intermediate step towards making a long-run process to try to hold in check the frequency of Type I errors (and, eventually, Type II errors). Therefore, adding the term "statistical" in this context - in my opinion - does not help.

Search This Blog

Statistical Thoughts