A more formal plea for ambitious AI benchmarks in cancer research

09 July 2025

A more formal plea for ambitious AI benchmarks in cancer research

A close friend inspired me to turn my recent blog post into a response to a recent RFI from the National Cancer Institute. As these responses never become public, and I am interested in others’ thoughts regarding some of these ideas, here was my response.

In response to the Request for Information, I offer the following input on the development of priority artificial intelligence benchmarks for cancer research.

What are AI-relevant use cases or tasks in cancer research and care that could be advanced through the availability of high-quality benchmarks?

We must aim far beyond tasks that are already largely solvable, such as image segmentation or treatment extraction from electronic health records. The true need is for benchmarks that define currently unachievable scientific goals. Here are some high-priority AI-relevant use cases in cancer research and care that would greatly benefit from novel benchmarks:

Predicting multi-cellular tissue response to localized perturbation: This goes beyond single-cell or bulk analysis. For example: “If you perturb gene X in cell population A within a specific tumor microenvironment, what specific and quantifiable changes would you expect to see in the gene expression, signaling pathways, and phenotypic behavior of neighboring cell populations B and C, and how would this collectively impact tumor growth/metastasis in vivo?” This requires understanding complex intercellular communication and emergent properties of tissues.
Predicting cancer incidence from molecular measurements of patient state: Patients present with many wide-spread molecular changes upon diagnosis, including reprogramming of their immune system. This task could aim to predict which patients in high-risk groups will be diagnosed with cancer, and of which type and when, given readily accessible molecular measurements, such as transcriptomics of their peripheral blood. Significant advancement here would aid the development of both predictive diagnostics and potentially preventative therapies.
Identifying optimal multi-modal therapeutic interventions for an individual patient's complex tumor state: This is not just predicting response to a single drug, but identifying the combination and sequence of therapies (e.g., specific chemotherapies, immunotherapies, targeted therapies, radiation, surgery) that will lead to a defined positive outcome (e.g., complete remission, prolonged progression-free survival, minimal side effects) based on their comprehensive molecular, cellular, and clinical profile. This moves beyond broad patient cohorts to truly individualized predictions of therapeutic efficacy and toxicity. This could be evaluated by providing molecular information about the tumor, alongside the sequence of therapies, and predicting the masked long-term survival.
Predicting long-term systemic impact of cancer and its treatment on patient physiology and quality of life: This benchmark would focus on integrated, whole-organism modeling. For example: “Given a patient's initial cancer diagnosis and treatment plan, predict the trajectory of specific organ function (e.g., cardiac, renal, neurological), immune system state, and patient-reported quality of life metrics over 5 years, accounting for potential late effects of treatment and disease progression.” This requires integrating diverse data types and understanding inter-organ dependencies.
Forecasting cancer evolution and emergence of resistance mechanisms under specific treatment regimens: Instead of merely detecting existing resistance, the benchmark would be: "Given a patient's tumor molecular profile at diagnosis and a proposed treatment regimen, predict the specific genetic and phenotypic alterations the tumor will acquire, and estimate the timeframe for this resistance to emerge in vivo." This requires dynamic, predictive models of evolutionary trajectories.

These are use cases where benchmarks are not merely scarce; they are non-existent because they require a level of biological understanding and predictive power we do not currently possess. Focusing on such challenges will ensure that AI development is aimed at generating novel biological capabilities, not just automating what we can already do.

What are the desired characteristics of benchmarks for these use cases, including but not limited to considerations of quality, utility, and availability?

The most critical characteristic of a benchmark should be its ability to define a quantifiable, testable, and currently unsolved scientific problem. Its utility will not be in comparing a dozen similar models on an existing dataset, but in providing a clear "North Star" for the entire field, compelling us to create models with entirely new predictive powers. This framework necessitates the adoption of masked or sequestered testing sets as a standard practice. By keeping evaluation data hidden, we can ensure objective, unbiased assessment of model performance, a crucial guardrail against the self-deception and publication bias that can otherwise hinder true progress. In many cases, the benchmark will define a task for which the necessary training data has not yet been collected, thereby spurring new experimental work as an integral part of the solution.

Along these lines, some specific desired characteristics for benchmarks include:

Defining a currently unachievable task: The benchmark must target problems that are not trivially solved by existing statistical methods and require significant advancements in AI and systems biology.
Clear, quantifiable metrics for success: Analogous to AlphaFold's protein structure prediction, there must be objective, numerical ways to assess model performance. From the examples above, this could involve:
- Quantifiable changes in gene expression and protein levels in specific cell types within a tissue.
- Measurable reduction in tumor volume, number of metastases, or time to recurrence in in vivo models.
- Specific and measurable improvements in organ function or patient-reported outcomes.
- Accuracy in predicting specific resistance mutations or pathways.
Requiring novel, out-of-sample data for validation: This is crucial to combat publication bias. The validation dataset must be kept separate and unknown to the model developers during training and development. This promotes true generalizability.
Masked testing sets: A portion of the evaluation data should be held secret, accessible only for submitting models and objectively assessing performance. This ensures unbiased evaluation.
Biological and clinical relevance: The benchmarks should address questions that, if solved, would genuinely advance our understanding of cancer biology or significantly improve patient care, rather than automating existing clinical or scientific tasks.
Multi-modal and multi-scale data integration: The problems often span genomic, proteomic, imaging, clinical, and physiological data, requiring models to integrate information across different biological scales (molecular to organismal).
Testable against measurements: The benchmark should be designed such that its proposed solution can eventually be verified through experimental or clinical measurements, even if those measurements are currently challenging to obtain. This encourages the collection of new experimental data and mechanistic understanding.
Promoting open exchange of ideas: The benchmark framework should encourage competition and collaboration, fostering an environment where different techniques can be rigorously compared and insights shared. For example, the AI community regularly publishes both approaches that advance performance or do not work.
Incentivizing major scientific advancement: The challenges should be significant enough to warrant substantial research effort and potentially large prizes, as seen in the protein structure prediction field.

What datasets currently exist that could contribute to or be adapted for benchmarking? Please include information about their size, annotation, availability, as well as AI use cases they could support.

While numerous datasets currently exist, they are insufficient for creating benchmarks by nature of being currently available. These existing resources are useful for training. However, determining whether a model is effective requires out-of-sample validation data that has never been seen by the developers. Furthermore, the very process of tackling a benchmark should involve the generation of new experimental knowledge. The paradigm should be less about fitting models to existing data and more about using models to generate bold, testable hypotheses.

What are the biggest barriers to creating and/or using benchmarks in cancer research and care?

The greatest barrier to creating and using meaningful benchmarks is a fundamental lack of consensus on the long-term goals for AI in biology. Without a shared understanding of what we are trying to achieve, efforts will remain scattered and focused on incremental advances. This is compounded by the immense difficulty and expense of generating the novel experimental data required for true, out-of-sample validation, which incentivizes a culture of self-validation and overly optimistic reporting. Ultimately, the field is hampered by a focus on automating existing data analysis rather than pursuing genuinely new scientific capabilities. Establishing ambitious, common challenges through benchmarks would be an effective way to overcome this inertia, fostering an open exchange of ideas, and creating a framework where funding can be directed toward efforts that demonstrably push the boundaries of science.

I hope that my core message has come across, that benchmarks should focus on tasks that we know are currently not possible without major scientific advancement. This means moving beyond incremental improvements on existing tasks. To achieve this, the NCI could consider:

Convening a multi-disciplinary working group: Bring together leading cancer biologists, clinicians, systems biologists, and data scientists to collectively define these "grand challenge" benchmarks, like how CASP was established for protein structure.
Funding dedicated benchmark development consortia: Support groups specifically tasked with curating existing data, generating new experimental data for masked testing sets, and developing the infrastructure for benchmark competitions.
Establishing "AI X-Prize" style challenges: Offer large prizes for models that meet predefined performance thresholds on these highly ambitious, currently unsolved problems in cancer. This would incentivize innovation and attract talent.
Funding novel modeling approaches: Once a benchmark is established, the NCI should support competitive applications regarding different modeling or experimental solutions to tackle the challenge. If the benchmarks have defined challenges of great significance, then proposals that make even incremental progress are of value.
Prioritizing funding for model validation: Encourage and fund independent validation efforts, especially those involving prospective data collection, to ensure rigor.
Fostering data sharing infrastructure: Invest in secure, federated learning platforms or data enclaves that allow AI models to be trained and tested on real-world cancer data while maintaining patient privacy.
Encouraging mechanistic interpretability in benchmarks: While performance is key, future benchmarks could also incorporate metrics or requirements for models to provide biologically plausible and interpretable insights, not just black-box predictions.

Ultimately, the goal should be to shift the focus from simple automation to using AI to achieve new biological and clinical understanding and capabilities that were previously out of reach. This strategic shift is crucial for AI to truly revolutionize cancer research and care.

I worry that the NCI is going to be convinced that the excitement around AI is a reason for big data collection efforts, again. It is not at all clear to me that we need more data, or what that data would be if so. We are drowning in data. We need to set ambitious goals that can be measured through hard benchmarks that we think are not currently solvable and then provide modeling resources and incentives to understand why they cannot currently be solved. Only then should we go collect more data.