envelope external link

Folding our knowledge in with the data—where systems biology could be headed

Last year in our machine learning/data analysis class, we had a bit of extra time to take a step back and return why we were there in the first place. Regardless of the method, machine learning provides us a toolbox of functions with varying amounts of flexibility and rigidness. By applying a method, we get a space of functions:

\[f(x, \beta)\]

parameterize it with some judge of value:

\[\arg\max_{\beta} \textrm{Value}(f(x, \beta))\]

and are left with a function having (hopefully) useful properties:

\[f(x, \hat{\beta})\]

Modeling provides benefits beyond prediction, including hypothesis testing, communication, interpretation, and visualization. With these competing goals in mind we roughly organized the various methods in our toolbox:

Our toolkit of models.

This helped us identify a couple features of the landscape. First, our choices sit along some front of a tradeoff between explainability/interpretability and prediction performance/flexibility. This is not a new idea—much has been written about this tradeoff. Second, explainable/interpretable models make strong assumptions/expectations about the structure of the data.

Before looking ahead, it is helpful to recognize some of the developments that have made it possible to move out in the direction of prediction/flexibility. Namely, the continued march of computational performance has worked alongside computational tools to enable flexibly-defined, high-parameter models. This includes a resurgence in autodifferentiation tools, parallel/vectorized evaluation, and probabilistic languages to coordinate it all.

However, while very high parameter models1 have given us accurate predictions, especially with ever-growing training data, these models are poor at extrapolation and interpretation. These two properties are especially critical to understanding biological systems; measurements are essentially always data starved, and the complexities of biological systems are such that even our highest-throughput experiments do not comprehensively sample every possible intervention we could make2. In other words, we can take lots of pictures of stop signs to teach a model to identify stop signs, but we almost always build models of cells to predict things we can’t or haven’t ever measured yet. We have to be able to see into the uncharted territory. Can your model identify stop signs if it had never seen one before?

So where does extrapolation come from? It comes from the inflexibility built into the model structure we choose. For example, ordinary least squares models are quite inflexible, and their inflexibility forces predictions where each variable extrapolates based on a linear relationship. Even flexible models, while restricted by the data within the training range, rely heavily on model inflexibility when extrapolating to entirely new predictions. However, the rigidity we encode to date has largely been tied to making a problem numerically tractable. For example, linear assumptions are often chosen because they gives us models that we can reasonably solve, but we don’t expect much of biology to be linear.

However, our new computational tools allow us to perform efficient parameterization with programs of nearly any structure. Through autodifferentiation and probabilistic languages, we can now write semi-flexible functions of essentially any structure and perform reasonably efficient parameterization. The key is flexibility where you don’t have prior knowledge and rigidity where you do.

We did this in a very simple way with our first paper on antibody responses. We expect antibodies to bind following the laws of binding kinetics. Then, without much expectation about how this binding relates to cell response, we allow a flexible statistical relationship. Yuan et al and Ma et al also use a similar idea of encoding the structural information we know (the dynamics and hierarchy of biology, respectively) into a deep learning model.

If I had to guess where machine learning will take systems biology, it is in this direction, providing a balance of flexibility and rigidity. Neural ODEs will lend themselves nicely to biology as a dynamic system where we have tools to measure dynamic responses and whether components interact. In fact, the authors summarize my point nicely: scientific knowledge [should be] encoded in structure, not data points. We will realize that the tools in our toolbox above are different views of the same solution, and learn how to encode our prior knowledge in model structure as well as parameters. Finally, I think we will re-focus our efforts on how far a model can extrapolate, rather than how accurately it can predict.

  1. Call them deep if you must. 

  2. Sure, single-cell methods can provide enormous throughput, but we ultimately care about what happens in a whole organism, and often in a whole person.