envelope external link

The Bad Luck of Improper Data Interpretation

An article and news summary is out in Science this week with a bold claim: that two-thirds of all cancers are due to baseline mutagenesis intrinsic to cell division, and not environmental factors or genetics. This is based on observed correlation between the number of stem cell divisions and incidence of cancer in various tissues. Certainly, such a conclusion would have immense consequences—it would emphasize treatment strategies over those of prevention, and refocus efforts away from understanding environmental toxins. Sadly, this conclusion is based on a frightening variety of errors in interpretation and basic math.

Most specifically, the two-thirds figure comes from the correlation coefficient between stem cell divisions and cancer incidence. While it is the case that the former explains 65% of the variation in cancer incidence between tissues, this does not translate to a percentage of cancer cases. This data is plotted on a log-log axis, and so distance along the plot is not linear. As the cancers with clear environmental factors are more common, the 65% claim is surely much lower.

Second, while this correlation might explain variation between tissues, it does not suggest the source of mutagenesis. Any factor that had similar effects throughout all tissues would vary this plot on the y-axis but have no effect on the correlation. Notably, as the data is plotted on a log axis, even for tissue-specific toxins many fold changes in the incidences of these cancers would still have no effect on the conclusions of this study.

Additional failures of interpretation in this study suggest little understanding of the data analysis involved. For example, k-means clustering a single variable lends little insight, and with outliers on either end is sure to form two groups with separation near the center of the range. This provides no evidence of there being two “classes” of cancer.

This article seems to be the product of lax peer review and pressure to over-interpret data to boost public interest. Both of these provide short-term gain to those involved but in the long run corrupt the scientific literature and erode public trust in science. Don’t do it!