Home · Ameyer.me

A more formal plea for ambitious AI benchmarks in cancer research

July 09, 2025

A close friend inspired me to turn my recent blog post into a response to a recent RFI from the National Cancer Institute. As these responses never become public, and I am interested in others’ thoughts regarding some of these ideas, here was my response.

In response to the Request for Information, I offer the following input on the development of priority artificial intelligence benchmarks for cancer research.

What are AI-relevant use cases or tasks in cancer research and care that could be advanced through the availability of high-quality benchmarks?

We must aim far beyond tasks that are already largely solvable, such as image segmentation or treatment extraction from electronic health records. The true need is for benchmarks that define currently unachievable scientific goals. Here are some high-priority AI-relevant use cases in cancer research and care that would greatly benefit from novel benchmarks:

Predicting multi-cellular tissue response to localized perturbation: This goes beyond single-cell or bulk analysis. For example: “If you perturb gene X in cell population A within a specific tumor microenvironment, what specific and quantifiable changes would you expect to see in the gene expression, signaling pathways, and phenotypic behavior of neighboring cell populations B and C, and how would this collectively impact tumor growth/metastasis in vivo?” This requires understanding complex intercellular communication and emergent properties of tissues.
Predicting cancer incidence from molecular measurements of patient state: Patients present with many wide-spread molecular changes upon diagnosis, including reprogramming of their immune system. This task could aim to predict which patients in high-risk groups will be diagnosed with cancer, and of which type and when, given readily accessible molecular measurements, such as transcriptomics of their peripheral blood. Significant advancement here would aid the development of both predictive diagnostics and potentially preventative therapies.
Identifying optimal multi-modal therapeutic interventions for an individual patient's complex tumor state: This is not just predicting response to a single drug, but identifying the combination and sequence of therapies (e.g., specific chemotherapies, immunotherapies, targeted therapies, radiation, surgery) that will lead to a defined positive outcome (e.g., complete remission, prolonged progression-free survival, minimal side effects) based on their comprehensive molecular, cellular, and clinical profile. This moves beyond broad patient cohorts to truly individualized predictions of therapeutic efficacy and toxicity. This could be evaluated by providing molecular information about the tumor, alongside the sequence of therapies, and predicting the masked long-term survival.
Predicting long-term systemic impact of cancer and its treatment on patient physiology and quality of life: This benchmark would focus on integrated, whole-organism modeling. For example: “Given a patient's initial cancer diagnosis and treatment plan, predict the trajectory of specific organ function (e.g., cardiac, renal, neurological), immune system state, and patient-reported quality of life metrics over 5 years, accounting for potential late effects of treatment and disease progression.” This requires integrating diverse data types and understanding inter-organ dependencies.
Forecasting cancer evolution and emergence of resistance mechanisms under specific treatment regimens: Instead of merely detecting existing resistance, the benchmark would be: "Given a patient's tumor molecular profile at diagnosis and a proposed treatment regimen, predict the specific genetic and phenotypic alterations the tumor will acquire, and estimate the timeframe for this resistance to emerge in vivo." This requires dynamic, predictive models of evolutionary trajectories.

These are use cases where benchmarks are not merely scarce; they are non-existent because they require a level of biological understanding and predictive power we do not currently possess. Focusing on such challenges will ensure that AI development is aimed at generating novel biological capabilities, not just automating what we can already do.

What are the desired characteristics of benchmarks for these use cases, including but not limited to considerations of quality, utility, and availability?

The most critical characteristic of a benchmark should be its ability to define a quantifiable, testable, and currently unsolved scientific problem. Its utility will not be in comparing a dozen similar models on an existing dataset, but in providing a clear "North Star" for the entire field, compelling us to create models with entirely new predictive powers. This framework necessitates the adoption of masked or sequestered testing sets as a standard practice. By keeping evaluation data hidden, we can ensure objective, unbiased assessment of model performance, a crucial guardrail against the self-deception and publication bias that can otherwise hinder true progress. In many cases, the benchmark will define a task for which the necessary training data has not yet been collected, thereby spurring new experimental work as an integral part of the solution.

Along these lines, some specific desired characteristics for benchmarks include:

Defining a currently unachievable task: The benchmark must target problems that are not trivially solved by existing statistical methods and require significant advancements in AI and systems biology.
Clear, quantifiable metrics for success: Analogous to AlphaFold's protein structure prediction, there must be objective, numerical ways to assess model performance. From the examples above, this could involve:
- Quantifiable changes in gene expression and protein levels in specific cell types within a tissue.
- Measurable reduction in tumor volume, number of metastases, or time to recurrence in in vivo models.
- Specific and measurable improvements in organ function or patient-reported outcomes.
- Accuracy in predicting specific resistance mutations or pathways.
Requiring novel, out-of-sample data for validation: This is crucial to combat publication bias. The validation dataset must be kept separate and unknown to the model developers during training and development. This promotes true generalizability.
Masked testing sets: A portion of the evaluation data should be held secret, accessible only for submitting models and objectively assessing performance. This ensures unbiased evaluation.
Biological and clinical relevance: The benchmarks should address questions that, if solved, would genuinely advance our understanding of cancer biology or significantly improve patient care, rather than automating existing clinical or scientific tasks.
Multi-modal and multi-scale data integration: The problems often span genomic, proteomic, imaging, clinical, and physiological data, requiring models to integrate information across different biological scales (molecular to organismal).
Testable against measurements: The benchmark should be designed such that its proposed solution can eventually be verified through experimental or clinical measurements, even if those measurements are currently challenging to obtain. This encourages the collection of new experimental data and mechanistic understanding.
Promoting open exchange of ideas: The benchmark framework should encourage competition and collaboration, fostering an environment where different techniques can be rigorously compared and insights shared. For example, the AI community regularly publishes both approaches that advance performance or do not work.
Incentivizing major scientific advancement: The challenges should be significant enough to warrant substantial research effort and potentially large prizes, as seen in the protein structure prediction field.

What datasets currently exist that could contribute to or be adapted for benchmarking? Please include information about their size, annotation, availability, as well as AI use cases they could support.

While numerous datasets currently exist, they are insufficient for creating benchmarks by nature of being currently available. These existing resources are useful for training. However, determining whether a model is effective requires out-of-sample validation data that has never been seen by the developers. Furthermore, the very process of tackling a benchmark should involve the generation of new experimental knowledge. The paradigm should be less about fitting models to existing data and more about using models to generate bold, testable hypotheses.

What are the biggest barriers to creating and/or using benchmarks in cancer research and care?

The greatest barrier to creating and using meaningful benchmarks is a fundamental lack of consensus on the long-term goals for AI in biology. Without a shared understanding of what we are trying to achieve, efforts will remain scattered and focused on incremental advances. This is compounded by the immense difficulty and expense of generating the novel experimental data required for true, out-of-sample validation, which incentivizes a culture of self-validation and overly optimistic reporting. Ultimately, the field is hampered by a focus on automating existing data analysis rather than pursuing genuinely new scientific capabilities. Establishing ambitious, common challenges through benchmarks would be an effective way to overcome this inertia, fostering an open exchange of ideas, and creating a framework where funding can be directed toward efforts that demonstrably push the boundaries of science.

I hope that my core message has come across, that benchmarks should focus on tasks that we know are currently not possible without major scientific advancement. This means moving beyond incremental improvements on existing tasks. To achieve this, the NCI could consider:

Convening a multi-disciplinary working group: Bring together leading cancer biologists, clinicians, systems biologists, and data scientists to collectively define these "grand challenge" benchmarks, like how CASP was established for protein structure.
Funding dedicated benchmark development consortia: Support groups specifically tasked with curating existing data, generating new experimental data for masked testing sets, and developing the infrastructure for benchmark competitions.
Establishing "AI X-Prize" style challenges: Offer large prizes for models that meet predefined performance thresholds on these highly ambitious, currently unsolved problems in cancer. This would incentivize innovation and attract talent.
Funding novel modeling approaches: Once a benchmark is established, the NCI should support competitive applications regarding different modeling or experimental solutions to tackle the challenge. If the benchmarks have defined challenges of great significance, then proposals that make even incremental progress are of value.
Prioritizing funding for model validation: Encourage and fund independent validation efforts, especially those involving prospective data collection, to ensure rigor.
Fostering data sharing infrastructure: Invest in secure, federated learning platforms or data enclaves that allow AI models to be trained and tested on real-world cancer data while maintaining patient privacy.
Encouraging mechanistic interpretability in benchmarks: While performance is key, future benchmarks could also incorporate metrics or requirements for models to provide biologically plausible and interpretable insights, not just black-box predictions.

Ultimately, the goal should be to shift the focus from simple automation to using AI to achieve new biological and clinical understanding and capabilities that were previously out of reach. This strategic shift is crucial for AI to truly revolutionize cancer research and care.

I worry that the NCI is going to be convinced that the excitement around AI is a reason for big data collection efforts, again. It is not at all clear to me that we need more data, or what that data would be if so. We are drowning in data. We need to set ambitious goals that can be measured through hard benchmarks that we think are not currently solvable and then provide modeling resources and incentives to understand why they cannot currently be solved. Only then should we go collect more data.

Systems biology and AI in biology need to define their goals through benchmarks

June 25, 2025

There’s a great deal of talk these days about building “foundation models” of cells and employing large-scale AI in biology. The idea is that these models could simulate cells, understand their various states, and predict how perturbations might affect cellular responses. However, many of the demonstrated capabilities of current foundation models can already be achieved, or even surpassed, by simple statistical models. If these complex models merely automate tasks that existing data science tools can already handle, what is their true purpose? Unlike software development, where automating repetitive tasks has value, solving a biological question once doesn’t necessitate constant re-automation or rediscovery. We need models with truly novel capabilities, not just more complex ways to do what we can already do.

A significant concern surrounding these new modeling approaches is the issue of validation and the potential for publication bias. Defining truly new capabilities for a model often requires collecting novel, out-of-sample data, which is a resource-intensive undertaking. This limits the number of groups that can both develop and rigorously validate their models, and leaves the very people developing these models as the ones validating them. We’re all familiar with publication bias, and it’s likely that the literature will present the most optimistic picture of how well these models function. Splashy, initial claims of success, which garner significant attention, will overshadow more careful and rigorous validations that inevitably highlight the models’ limitations.

Compounding these issues is a fundamental lack of common definitions and goals within the field. There is no shared understanding of what these large-scale biological modeling efforts should accomplish. Do we need a model that can translate individual cells’ gene expression between species, or one that can identify novel cell populations? Are these tasks on the road to curing a disease? We would greatly benefit from a collective undertaking to define the capabilities we currently lack and that these models should strive to achieve. Without this clarity, efforts risk becoming disparate and unfocused.

These observations collectively point to a critical need for developing goals defined through benchmarks. The field needs to come together and define specific, currently unachievable tasks that computational structures could solve. These benchmarks don’t necessarily require closed datasets; in some cases, solving them might necessitate the collection of new experimental data and knowledge. The protein structure field and AlphaFold serves as an excellent example: the field provided a standardized task—predicting protein structure from sequence—with clear, quantifiable metrics for success. This undertaking remained effectively unsolved for decades, and required both new experimental data and mechanistic understanding of protein folding. We need a similar framework for systems biology and AI models that aim to represent biology computationally.

We need ways to benchmark model performance in a testable manner. For instance, consider these benchmark tasks: if you perturb a gene within one cell population of a tissue, what changes would you expect to see in the neighboring cells? Or, which perturbations, in which cell populations, would you expect to affect overall tissue function? These are specific questions that could be tested against measurements to form the basis of a quantifiable framework.

Furthermore, we should adopt the practice of masked testing sets, a highly successful approach within the AI community. This involves keeping a portion of the evaluation data secret, allowing researchers to submit their models and objectively assess their performance against this hidden data. This ensures a truly unbiased assessment of a model’s capabilities, similar to how protein structure prediction models were evaluated. These benchmarks have proven their value time and again; researchers can even fool themselves about the performance of their models without this objective assessment.

Until we establish common challenges and rigorous benchmarking methods that allow us to assess the success of our modeling endeavors, the field will continue to see an explosion of different techniques, all claiming superiority in various ways, but languish without a common scientific “language” or objective framework for comparison. By forcing the field to test against a common goal, it would allow the field to make progress through the open exchange of ideas. Funders could support competitive efforts to make progress, or establish large prizes to incentivize teams to pass certain thresholds in performance.

Defining a set of benchmarks that evaluate the new and unique capabilities of these models is absolutely crucial for making meaningful progress in large-scale biological modeling. Again, I’m not talking about the sort of benchmarks we already see, that seek to compare existing methods on mostly solved tasks. I mean benchmarks that strike at the heart of what “solving” biology would mean, and tasks that we know are currently not possible without major scientific advancement. By the way, I think that defining these goals will reveal that current approaches are at the wrong level of abstraction. We don’t need a foundation model of cells; we need a foundation model of a person, of an organism, of how cells and tissues are organized across the body. That’s a much bigger undertaking, but defining what it is we are trying to accomplish, we can focus on what will advance the science.