Systems biology and AI in biology need to define their goals through benchmarks
There’s a great deal of talk these days about building “foundation models” of cells and employing large-scale AI in biology. The idea is that these models could simulate cells, understand their various states, and predict how perturbations might affect cellular responses. However, many of the demonstrated capabilities of current foundation models can already be achieved, or even surpassed, by simple statistical models. If these complex models merely automate tasks that existing data science tools can already handle, what is their true purpose? Unlike software development, where automating repetitive tasks has value, solving a biological question once doesn’t necessitate constant re-automation or rediscovery. We truly need models with novel capabilities, not just more complex ways to do what we can already do.
A significant concern surrounding these new modeling approaches is the issue of validation and the potential for publication bias. Defining truly new capabilities for a model often requires collecting novel, out-of-sample data, which is a resource-intensive undertaking. This limits the number of groups that can both develop and rigorously validate their models, and leaves the very people developing these models as the ones validating them. We’re all familiar with publication bias, and it’s likely that the literature will present the most optimistic picture of how well these models function. Splashy, initial claims of success, which garner significant attention, will overshadow more careful and rigorous validations that inevitably highlight the models’ limitations.
Compounding these issues is a fundamental lack of common definitions and goals within the field. There is no shared understanding of what these large-scale biological modeling efforts should accomplish. Do we need a model that can translate individual cells’ gene expression between species, or one that can identify novel cell populations? Are these tasks on the road to curing a disease? We would greatly benefit from a collective undertaking to define the capabilities we currently lack and that these models should strive to achieve. Without this clarity, efforts risk becoming disparate and unfocused.
These observations collectively point to a critical need for developing goals defined through benchmarks. The field needs to come together and define specific, currently unachievable tasks that computational structures could solve. These benchmarks don’t necessarily require closed datasets; in some cases, solving them might necessitate the collection of new experimental data and knowledge. The protein structure field and AlphaFold serves as an excellent example: the field provided a standardized task—predicting protein structure from sequence—with clear, quantifiable metrics for success. This undertaking remained effectively unsolved for decades, and required both new experimental data and mechanistic understanding of protein folding. We need a similar framework for systems biology and AI models that aim to represent biology computationally.
We need ways to benchmark model performance in a testable manner. For instance, consider these benchmark tasks: if you perturb a gene within one cell population of a tissue, what changes would you expect to see in the neighboring cells? Or, which perturbations, in which cell populations, would you expect to affect overall tissue function? These are specific questions that could be tested against measurements to form the basis of a quantifiable framework.
Furthermore, we should adopt the practice of masked testing sets, a highly successful approach within the AI community. This involves keeping a portion of the evaluation data secret, allowing researchers to submit their models and objectively assess their performance against this hidden data. This ensures a truly unbiased assessment of a model’s capabilities, similar to how protein structure prediction models were evaluated. These benchmarks have proven their value time and again; researchers can even fool themselves about the performance of their models without this objective assessment.
Until we establish common challenges and rigorous benchmarking methods that allow us to assess the success of our modeling endeavors, the field will continue to see an explosion of different techniques, all claiming superiority in various ways, but languish without a common scientific “language” or objective framework for comparison. By forcing the field to test against a common goal, it would allow the field to make progress through the open exchange of ideas. Funders could support competitive efforts to make progress, or establish large prizes to incentivize teams to pass certain thresholds in performance.
Defining a set of benchmarks that evaluate the new and unique capabilities of these models is absolutely crucial for making meaningful progress in large-scale biological modeling. Again, I’m not talking about the sort of benchmarks we already see, that seek to compare existing methods on mostly solved tasks. I mean benchmarks that strike at the heart of what “solving” biology would mean, and tasks that we know are currently not possible without major scientific advancement. By the way, I think that defining these goals will reveal that current approaches are at the wrong level of abstraction. We don’t need a foundation model of cells; we need a foundation model of a person, of an organism, of how cells and tissues are organized across the body. That’s a much bigger undertaking, but defining what it is we are trying to accomplish, we can focus on what will advance the science.