envelope external link

A more formal plea for ambitious AI benchmarks in cancer research

A close friend inspired me to turn my recent blog post into a response to a recent RFI from the National Cancer Institute. As these responses never become public, and I am interested in others’ thoughts regarding some of these ideas, here was my response.


In response to the Request for Information, I offer the following input on the development of priority artificial intelligence benchmarks for cancer research.

What are AI-relevant use cases or tasks in cancer research and care that could be advanced through the availability of high-quality benchmarks?

We must aim far beyond tasks that are already largely solvable, such as image segmentation or treatment extraction from electronic health records. The true need is for benchmarks that define currently unachievable scientific goals. Here are some high-priority AI-relevant use cases in cancer research and care that would greatly benefit from novel benchmarks:

These are use cases where benchmarks are not merely scarce; they are non-existent because they require a level of biological understanding and predictive power we do not currently possess. Focusing on such challenges will ensure that AI development is aimed at generating novel biological capabilities, not just automating what we can already do.

What are the desired characteristics of benchmarks for these use cases, including but not limited to considerations of quality, utility, and availability?

The most critical characteristic of a benchmark should be its ability to define a quantifiable, testable, and currently unsolved scientific problem. Its utility will not be in comparing a dozen similar models on an existing dataset, but in providing a clear "North Star" for the entire field, compelling us to create models with entirely new predictive powers. This framework necessitates the adoption of masked or sequestered testing sets as a standard practice. By keeping evaluation data hidden, we can ensure objective, unbiased assessment of model performance, a crucial guardrail against the self-deception and publication bias that can otherwise hinder true progress. In many cases, the benchmark will define a task for which the necessary training data has not yet been collected, thereby spurring new experimental work as an integral part of the solution.

Along these lines, some specific desired characteristics for benchmarks include:

What datasets currently exist that could contribute to or be adapted for benchmarking? Please include information about their size, annotation, availability, as well as AI use cases they could support.

While numerous datasets currently exist, they are insufficient for creating benchmarks by nature of being currently available. These existing resources are useful for training. However, determining whether a model is effective requires out-of-sample validation data that has never been seen by the developers. Furthermore, the very process of tackling a benchmark should involve the generation of new experimental knowledge. The paradigm should be less about fitting models to existing data and more about using models to generate bold, testable hypotheses.

What are the biggest barriers to creating and/or using benchmarks in cancer research and care?

The greatest barrier to creating and using meaningful benchmarks is a fundamental lack of consensus on the long-term goals for AI in biology. Without a shared understanding of what we are trying to achieve, efforts will remain scattered and focused on incremental advances. This is compounded by the immense difficulty and expense of generating the novel experimental data required for true, out-of-sample validation, which incentivizes a culture of self-validation and overly optimistic reporting. Ultimately, the field is hampered by a focus on automating existing data analysis rather than pursuing genuinely new scientific capabilities. Establishing ambitious, common challenges through benchmarks would be an effective way to overcome this inertia, fostering an open exchange of ideas, and creating a framework where funding can be directed toward efforts that demonstrably push the boundaries of science.

Please provide any additional information you would like to share on this topic.

I hope that my core message has come across, that benchmarks should focus on tasks that we know are currently not possible without major scientific advancement. This means moving beyond incremental improvements on existing tasks. To achieve this, the NCI could consider:

Ultimately, the goal should be to shift the focus from simple automation to using AI to achieve new biological and clinical understanding and capabilities that were previously out of reach. This strategic shift is crucial for AI to truly revolutionize cancer research and care.

I worry that the NCI is going to be convinced that the excitement around AI is a reason for big data collection efforts, again. It is not at all clear to me that we need more data, or what that data would be if so. We are drowning in data. We need to set ambitious goals that can be measured through hard benchmarks that we think are not currently solvable and then provide modeling resources and incentives to understand why they cannot currently be solved. Only then should we go collect more data.