Home · Ameyer.me

Systems biology and AI in biology need to define their goals through benchmarks

June 25, 2025

There’s a great deal of talk these days about building “foundation models” of cells and employing large-scale AI in biology. The idea is that these models could simulate cells, understand their various states, and predict how perturbations might affect cellular responses. However, many of the demonstrated capabilities of current foundation models can already be achieved, or even surpassed, by simple statistical models. If these complex models merely automate tasks that existing data science tools can already handle, what is their true purpose? Unlike software development, where automating repetitive tasks has value, solving a biological question once doesn’t necessitate constant re-automation or rediscovery. We truly need models with novel capabilities, not just more complex ways to do what we can already do.

A significant concern surrounding these new modeling approaches is the issue of validation and the potential for publication bias. Defining truly new capabilities for a model often requires collecting novel, out-of-sample data, which is a resource-intensive undertaking. This limits the number of groups that can both develop and rigorously validate their models, and leaves the very people developing these models as the ones validating them. We’re all familiar with publication bias, and it’s likely that the literature will present the most optimistic picture of how well these models function. Splashy, initial claims of success, which garner significant attention, will overshadow more careful and rigorous validations that inevitably highlight the models’ limitations.

Compounding these issues is a fundamental lack of common definitions and goals within the field. There is no shared understanding of what these large-scale biological modeling efforts should accomplish. Do we need a model that can translate individual cells’ gene expression between species, or one that can identify novel cell populations? Are these tasks on the road to curing a disease? We would greatly benefit from a collective undertaking to define the capabilities we currently lack and that these models should strive to achieve. Without this clarity, efforts risk becoming disparate and unfocused.

These observations collectively point to a critical need for developing goals defined through benchmarks. The field needs to come together and define specific, currently unachievable tasks that computational structures could solve. These benchmarks don’t necessarily require closed datasets; in some cases, solving them might necessitate the collection of new experimental data and knowledge. The protein structure field and AlphaFold serves as an excellent example: the field provided a standardized task—predicting protein structure from sequence—with clear, quantifiable metrics for success. This undertaking remained effectively unsolved for decades, and required both new experimental data and mechanistic understanding of protein folding. We need a similar framework for systems biology and AI models that aim to represent biology computationally.

We need ways to benchmark model performance in a testable manner. For instance, consider these benchmark tasks: if you perturb a gene within one cell population of a tissue, what changes would you expect to see in the neighboring cells? Or, which perturbations, in which cell populations, would you expect to affect overall tissue function? These are specific questions that could be tested against measurements to form the basis of a quantifiable framework.

Furthermore, we should adopt the practice of masked testing sets, a highly successful approach within the AI community. This involves keeping a portion of the evaluation data secret, allowing researchers to submit their models and objectively assess their performance against this hidden data. This ensures a truly unbiased assessment of a model’s capabilities, similar to how protein structure prediction models were evaluated. These benchmarks have proven their value time and again; researchers can even fool themselves about the performance of their models without this objective assessment.

Until we establish common challenges and rigorous benchmarking methods that allow us to assess the success of our modeling endeavors, the field will continue to see an explosion of different techniques, all claiming superiority in various ways, but languish without a common scientific “language” or objective framework for comparison. By forcing the field to test against a common goal, it would allow the field to make progress through the open exchange of ideas. Funders could support competitive efforts to make progress, or establish large prizes to incentivize teams to pass certain thresholds in performance.

Defining a set of benchmarks that evaluate the new and unique capabilities of these models is absolutely crucial for making meaningful progress in large-scale biological modeling. Again, I’m not talking about the sort of benchmarks we already see, that seek to compare existing methods on mostly solved tasks. I mean benchmarks that strike at the heart of what “solving” biology would mean, and tasks that we know are currently not possible without major scientific advancement. By the way, I think that defining these goals will reveal that current approaches are at the wrong level of abstraction. We don’t need a foundation model of cells; we need a foundation model of a person, of an organism, of how cells and tissues are organized across the body. That’s a much bigger undertaking, but defining what it is we are trying to accomplish, we can focus on what will advance the science.

Caring for science that is on life support

May 20, 2025

While uncertain, all signs point to a prolonged period of significantly reduced funding for science. Even if the current administration were to disappear in four years, funding agencies move slowly in the best of times, and substantial damage has been inflicted on their ability to accomplish even basic functions.

The consequences of this funding crisis are causing incalculable harms; nothing in this post is going to change that. I have already seen the damage from so many talented trainees having their career plans disrupted or choosing to go to Canada or the U.K. for their graduate training. It’s infuriating to think about the scale of waste in an entire nation abandoning investments in its most talented and dedicated youth who represent its future innovation potential. Longer term, I fear for this country. Scientific advancements are at the core of our economic, political, and military might, and we are abandoning the people and institutions that generate this progress.

This challenging situation necessitates a re-evaluation of our research strategies to consider what can be accomplished within these new constraints and how, both to minimize harm and take advantage of the situation wherever possible. Massive disruptions usually mean that you might need to rethink your overall strategy and plans. This situation is no different. Maintaining scientific progress will be as important as ever so we continue to address critical problems.

This situation has led me to ponder what sort of questions I can still pursue with much less funding; I think that the answer, at least in my case, is quite a lot. Many of the most impactful scientific breakthroughs have come through the cheapest, simplest experiments, and an immense privilege of my current position is that funding almost exclusively supports students and laboratory materials, not my salary. Moreover, we are living with an absolute deluge of data from omics and large-scale studies that few ever seem to have time to revisit, and I think there are many opportunities to use these data for discovery¹. My day-to-day tasks would change quite a lot: I would do all the experiments, data analysis, and paper writing myself. My lab’s work would rely more heavily on reanalysis of existing data, with a key experiment mixed in. However, would our intellectual impact—the novelty of the findings, the influence on subsequent research, the creation of new knowledge—be reduced? I am not sure it would².

I also wonder what has been lost by the chase for funding. As graduate students, my peers and I had only a vague understanding of how our work was funded. This left us with the freedom to consider what might be possible, without putting a dollar amount next to ideas from the start. As I have gotten more senior, I have seen these big ideas get mashed into the specific aims of proposals, and discussions even start from the perspective of what the funder would want. The cost of chasing funding has been so immense, both in the time committed to writing proposal after proposal, and in limiting the bounds of what is possible to what is fundable. As the fraction of funded proposals has gone from 25%, to 10%, to single percentage points, the cost of time per funding dollar has exploded.

This has really got me thinking lately. If there is a version of a lab with the same impact using less funding, why shouldn’t I choose this approach? There are several potential downsides I can imagine:

Under this change, our lab would have far fewer trainees. University research both serves to uncover new knowledge and train students. This may be good or bad³.
I anticipate that the intellectual impact of our work would be the same, but I am not sure this would be the perception of my field. For better or worse, ideas in biomedical research are often judged by simplistic rules which can sometimes overshadow the intellectual rigor or novelty of the work. Is there an in vivo model? This is just a reanalysis of existing data. These responses do impact the perceived value of our work. At the same time, fighting for funding is often a big part of why we are forced to conform to these rules. Maybe there is value in being freed from these constraints.
There are questions that we currently can ask through larger-scale experiments, that would become inaccessible. However, there are also questions that I can’t pursue right now, as a result of my time and focus being pulled away to fund and support a larger group. We might lose the ability to conduct large-scale screening projects, but gain the capacity for deep, focused theoretical work or the development of novel computational methods.

In the end, we will have to choose how to adapt in this moment, and none of us can predict exactly what the future will bring.

I increasingly feel that integrating data across scales is the key challenge of our time. ↩
Part of my judgement here is tied to AI and the cost/benefit of graduate training. When it comes to computational work, it has become easier to do more yourself, and I expect that this trend will continue. ↩
There are many inconsistent and conflicting roles of graduate education that warrant its own post one day. ↩