Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance

Willem M Otte; Christiaan H Vinkers; Philippe C Habets; David G P van IJzendoorn; Joeri K Tijdink

doi:https://doi.org/10.1371/journal.pbio.3001562

Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance

Willem M Otte, Christiaan H Vinkers, Philippe C Habets, David G P van IJzendoorn, Joeri K Tijdink

Research output: Contribution to journal › Article › Academic › peer-review

15 Citations (Scopus)

Abstract

The power of language to modify the reader's perception of interpreting biomedical results cannot be underestimated. Misreporting and misinterpretation are pressing problems in randomized controlled trials (RCT) output. This may be partially related to the statistical significance paradigm used in clinical trials centered around a P value below 0.05 cutoff. Strict use of this P value may lead to strategies of clinical researchers to describe their clinical results with P values approaching but not reaching the threshold to be "almost significant." The question is how phrases expressing nonsignificant results have been reported in RCTs over the past 30 years. To this end, we conducted a quantitative analysis of English full texts containing 567,758 RCTs recorded in PubMed between 1990 and 2020 (81.5% of all published RCTs in PubMed). We determined the exact presence of 505 predefined phrases denoting results that approach but do not cross the line of formal statistical significance (P < 0.05). We modeled temporal trends in phrase data with Bayesian linear regression. Evidence for temporal change was obtained through Bayes factor (BF) analysis. In a randomly sampled subset, the associated P values were manually extracted. We identified 61,741 phrases in 49,134 RCTs indicating almost significant results (8.65%; 95% confidence interval (CI): 8.58% to 8.73%). The overall prevalence of these phrases remained stable over time, with the most prevalent phrases being "marginally significant" (in 7,735 RCTs), "all but significant" (7,015), "a nonsignificant trend" (3,442), "failed to reach statistical significance" (2,578), and "a strong trend" (1,700). The strongest evidence for an increased temporal prevalence was found for "a numerical trend," "a positive trend," "an increasing trend," and "nominally significant." In contrast, the phrases "all but significant," "approaches statistical significance," "did not quite reach statistical significance," "difference was apparent," "failed to reach statistical significance," and "not quite significant" decreased over time. In a random sampled subset of 29,000 phrases, the manually identified and corresponding 11,926 P values, 68,1% ranged between 0.05 and 0.15 (CI: 67. to 69.0; median 0.06). Our results show that RCT reports regularly contain specific phrases describing marginally nonsignificant results to report P values close to but above the dominant 0.05 cutoff. The fact that the prevalence of the phrases remained stable over time indicates that this practice of broadly interpreting P values close to a predefined threshold remains prevalent. To enhance responsible and transparent interpretation of RCT results, researchers, clinicians, reviewers, and editors may reduce the focus on formal statistical significance thresholds and stimulate reporting of P values with corresponding effect sizes and CIs and focus on the clinical relevance of the statistical difference found in RCTs.

Original language	English
Article number	e3001562
Pages (from-to)	e3001562
Journal	PLoS biology
Volume	20
Issue number	2
DOIs	https://doi.org/10.1371/journal.pbio.3001562
Publication status	E-pub ahead of print - 18 Feb 2022

Access to Document

https://doi.org/10.1371/journal.pbio.3001562

Cite this

Otte, W. M., Vinkers, C. H., Habets, P. C., van IJzendoorn, D. G. P., & Tijdink, J. K. (2022). Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance. PLoS biology, 20(2), e3001562. Article e3001562. Advance online publication. https://doi.org/10.1371/journal.pbio.3001562

@article{a3f53261bbe6406dbb13c3abe00ada1f,

title = "Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance",

abstract = "The power of language to modify the reader's perception of interpreting biomedical results cannot be underestimated. Misreporting and misinterpretation are pressing problems in randomized controlled trials (RCT) output. This may be partially related to the statistical significance paradigm used in clinical trials centered around a P value below 0.05 cutoff. Strict use of this P value may lead to strategies of clinical researchers to describe their clinical results with P values approaching but not reaching the threshold to be {"}almost significant.{"} The question is how phrases expressing nonsignificant results have been reported in RCTs over the past 30 years. To this end, we conducted a quantitative analysis of English full texts containing 567,758 RCTs recorded in PubMed between 1990 and 2020 (81.5% of all published RCTs in PubMed). We determined the exact presence of 505 predefined phrases denoting results that approach but do not cross the line of formal statistical significance (P < 0.05). We modeled temporal trends in phrase data with Bayesian linear regression. Evidence for temporal change was obtained through Bayes factor (BF) analysis. In a randomly sampled subset, the associated P values were manually extracted. We identified 61,741 phrases in 49,134 RCTs indicating almost significant results (8.65%; 95% confidence interval (CI): 8.58% to 8.73%). The overall prevalence of these phrases remained stable over time, with the most prevalent phrases being {"}marginally significant{"} (in 7,735 RCTs), {"}all but significant{"} (7,015), {"}a nonsignificant trend{"} (3,442), {"}failed to reach statistical significance{"} (2,578), and {"}a strong trend{"} (1,700). The strongest evidence for an increased temporal prevalence was found for {"}a numerical trend,{"} {"}a positive trend,{"} {"}an increasing trend,{"} and {"}nominally significant.{"} In contrast, the phrases {"}all but significant,{"} {"}approaches statistical significance,{"} {"}did not quite reach statistical significance,{"} {"}difference was apparent,{"} {"}failed to reach statistical significance,{"} and {"}not quite significant{"} decreased over time. In a random sampled subset of 29,000 phrases, the manually identified and corresponding 11,926 P values, 68,1% ranged between 0.05 and 0.15 (CI: 67. to 69.0; median 0.06). Our results show that RCT reports regularly contain specific phrases describing marginally nonsignificant results to report P values close to but above the dominant 0.05 cutoff. The fact that the prevalence of the phrases remained stable over time indicates that this practice of broadly interpreting P values close to a predefined threshold remains prevalent. To enhance responsible and transparent interpretation of RCT results, researchers, clinicians, reviewers, and editors may reduce the focus on formal statistical significance thresholds and stimulate reporting of P values with corresponding effect sizes and CIs and focus on the clinical relevance of the statistical difference found in RCTs.",

author = "Otte, {Willem M} and Vinkers, {Christiaan H} and Habets, {Philippe C} and {van IJzendoorn}, {David G P} and Tijdink, {Joeri K}",

note = "Publisher Copyright: {\textcopyright} 2022 Otte et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.",

year = "2022",

month = feb,

day = "18",

doi = "https://doi.org/10.1371/journal.pbio.3001562",

language = "English",

volume = "20",

pages = "e3001562",

journal = "PLoS biology",

issn = "1544-9173",

publisher = "Public Library of Science",

number = "2",

}

Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance. / Otte, Willem M; Vinkers, Christiaan H; Habets, Philippe C et al.
In: PLoS biology, Vol. 20, No. 2, e3001562, 18.02.2022, p. e3001562.

Research output: Contribution to journal › Article › Academic › peer-review

TY - JOUR

T1 - Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance

AU - Otte, Willem M

AU - Vinkers, Christiaan H

AU - Habets, Philippe C

AU - van IJzendoorn, David G P

AU - Tijdink, Joeri K

N1 - Publisher Copyright: © 2022 Otte et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PY - 2022/2/18

Y1 - 2022/2/18

N2 - The power of language to modify the reader's perception of interpreting biomedical results cannot be underestimated. Misreporting and misinterpretation are pressing problems in randomized controlled trials (RCT) output. This may be partially related to the statistical significance paradigm used in clinical trials centered around a P value below 0.05 cutoff. Strict use of this P value may lead to strategies of clinical researchers to describe their clinical results with P values approaching but not reaching the threshold to be "almost significant." The question is how phrases expressing nonsignificant results have been reported in RCTs over the past 30 years. To this end, we conducted a quantitative analysis of English full texts containing 567,758 RCTs recorded in PubMed between 1990 and 2020 (81.5% of all published RCTs in PubMed). We determined the exact presence of 505 predefined phrases denoting results that approach but do not cross the line of formal statistical significance (P < 0.05). We modeled temporal trends in phrase data with Bayesian linear regression. Evidence for temporal change was obtained through Bayes factor (BF) analysis. In a randomly sampled subset, the associated P values were manually extracted. We identified 61,741 phrases in 49,134 RCTs indicating almost significant results (8.65%; 95% confidence interval (CI): 8.58% to 8.73%). The overall prevalence of these phrases remained stable over time, with the most prevalent phrases being "marginally significant" (in 7,735 RCTs), "all but significant" (7,015), "a nonsignificant trend" (3,442), "failed to reach statistical significance" (2,578), and "a strong trend" (1,700). The strongest evidence for an increased temporal prevalence was found for "a numerical trend," "a positive trend," "an increasing trend," and "nominally significant." In contrast, the phrases "all but significant," "approaches statistical significance," "did not quite reach statistical significance," "difference was apparent," "failed to reach statistical significance," and "not quite significant" decreased over time. In a random sampled subset of 29,000 phrases, the manually identified and corresponding 11,926 P values, 68,1% ranged between 0.05 and 0.15 (CI: 67. to 69.0; median 0.06). Our results show that RCT reports regularly contain specific phrases describing marginally nonsignificant results to report P values close to but above the dominant 0.05 cutoff. The fact that the prevalence of the phrases remained stable over time indicates that this practice of broadly interpreting P values close to a predefined threshold remains prevalent. To enhance responsible and transparent interpretation of RCT results, researchers, clinicians, reviewers, and editors may reduce the focus on formal statistical significance thresholds and stimulate reporting of P values with corresponding effect sizes and CIs and focus on the clinical relevance of the statistical difference found in RCTs.

AB - The power of language to modify the reader's perception of interpreting biomedical results cannot be underestimated. Misreporting and misinterpretation are pressing problems in randomized controlled trials (RCT) output. This may be partially related to the statistical significance paradigm used in clinical trials centered around a P value below 0.05 cutoff. Strict use of this P value may lead to strategies of clinical researchers to describe their clinical results with P values approaching but not reaching the threshold to be "almost significant." The question is how phrases expressing nonsignificant results have been reported in RCTs over the past 30 years. To this end, we conducted a quantitative analysis of English full texts containing 567,758 RCTs recorded in PubMed between 1990 and 2020 (81.5% of all published RCTs in PubMed). We determined the exact presence of 505 predefined phrases denoting results that approach but do not cross the line of formal statistical significance (P < 0.05). We modeled temporal trends in phrase data with Bayesian linear regression. Evidence for temporal change was obtained through Bayes factor (BF) analysis. In a randomly sampled subset, the associated P values were manually extracted. We identified 61,741 phrases in 49,134 RCTs indicating almost significant results (8.65%; 95% confidence interval (CI): 8.58% to 8.73%). The overall prevalence of these phrases remained stable over time, with the most prevalent phrases being "marginally significant" (in 7,735 RCTs), "all but significant" (7,015), "a nonsignificant trend" (3,442), "failed to reach statistical significance" (2,578), and "a strong trend" (1,700). The strongest evidence for an increased temporal prevalence was found for "a numerical trend," "a positive trend," "an increasing trend," and "nominally significant." In contrast, the phrases "all but significant," "approaches statistical significance," "did not quite reach statistical significance," "difference was apparent," "failed to reach statistical significance," and "not quite significant" decreased over time. In a random sampled subset of 29,000 phrases, the manually identified and corresponding 11,926 P values, 68,1% ranged between 0.05 and 0.15 (CI: 67. to 69.0; median 0.06). Our results show that RCT reports regularly contain specific phrases describing marginally nonsignificant results to report P values close to but above the dominant 0.05 cutoff. The fact that the prevalence of the phrases remained stable over time indicates that this practice of broadly interpreting P values close to a predefined threshold remains prevalent. To enhance responsible and transparent interpretation of RCT results, researchers, clinicians, reviewers, and editors may reduce the focus on formal statistical significance thresholds and stimulate reporting of P values with corresponding effect sizes and CIs and focus on the clinical relevance of the statistical difference found in RCTs.

UR - http://www.scopus.com/inward/record.url?scp=85125682579&partnerID=8YFLogxK

U2 - https://doi.org/10.1371/journal.pbio.3001562

DO - https://doi.org/10.1371/journal.pbio.3001562

M3 - Article

C2 - 35180228

SN - 1544-9173

VL - 20

SP - e3001562

JO - PLoS biology

JF - PLoS biology

IS - 2

M1 - e3001562

ER -

Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance

Abstract

Access to Document

Other files and links

Cite this