I'm Aaron Meyer: a bioengineer, cyclist, and nerd.

### Folding our knowledge in with the data—where systems biology could be headed

Last year in our machine learning/data analysis class, we had a bit of extra time to take a step back and return why we were there in the first place. Regardless of the method, machine learning provides us a toolbox of functions with varying amounts of flexibility and rigidness. By applying a method, we get a space of functions:

$f(x, \beta)$

parameterize it with some judge of value:

$\arg\max_{\beta} \textrm{Value}(f(x, \beta))$

and are left with a function having (hopefully) useful properties:

$f(x, \hat{\beta})$

Modeling provides benefits beyond prediction, including hypothesis testing, communication, interpretation, and visualization. With these competing goals in mind we roughly organized the various methods in our toolbox:

This helped us identify a couple features of the landscape. First, our choices sit along some front of a tradeoff between explainability/interpretability and prediction performance/flexibility. This is not a new idea—much has been written about this tradeoff. Second, explainable/interpretable models make strong assumptions/expectations about the structure of the data.

Before looking ahead, it is helpful to recognize some of the developments that have made it possible to move out in the direction of prediction/flexibility. Namely, the continued march of computational performance has worked alongside computational tools to enable flexibly-defined, high-parameter models. This includes a resurgence in autodifferentiation tools, parallel/vectorized evaluation, and probabilistic languages to coordinate it all.

However, while very high parameter models1 have given us accurate predictions, especially with ever-growing training data, these models are poor at extrapolation and interpretation. These two properties are especially critical to understanding biological systems; measurements are essentially always data starved, and the complexities of biological systems are such that even our highest-throughput experiments do not comprehensively sample every possible intervention we could make2. In other words, we can take lots of pictures of stop signs to teach a model to identify stop signs, but we almost always build models of cells to predict things we can’t or haven’t ever measured yet. We have to be able to see into the uncharted territory. Can your model identify stop signs if it had never seen one before?

So where does extrapolation come from? It comes from the inflexibility built into the model structure we choose. For example, ordinary least squares models are quite inflexible, and their inflexibility forces predictions where each variable extrapolates based on a linear relationship. Even flexible models, while restricted by the data within the training range, rely heavily on model inflexibility when extrapolating to entirely new predictions. However, the rigidity we encode to date has largely been tied to making a problem numerically tractable. For example, linear assumptions are often chosen because they gives us models that we can reasonably solve, but we don’t expect much of biology to be linear.

However, our new computational tools allow us to perform efficient parameterization with programs of nearly any structure. Through autodifferentiation and probabilistic languages, we can now write semi-flexible functions of essentially any structure and perform reasonably efficient parameterization. The key is flexibility where you don’t have prior knowledge and rigidity where you do.

We did this in a very simple way with our first paper on antibody responses. We expect antibodies to bind following the laws of binding kinetics. Then, without much expectation about how this binding relates to cell response, we allow a flexible statistical relationship. Yuan et al and Ma et al also use a similar idea of encoding the structural information we know (the dynamics and hierarchy of biology, respectively) into a deep learning model.

If I had to guess where machine learning will take systems biology, it is in this direction, providing a balance of flexibility and rigidity. Neural ODEs will lend themselves nicely to biology as a dynamic system where we have tools to measure dynamic responses and whether components interact. In fact, the authors summarize my point nicely: scientific knowledge [should be] encoded in structure, not data points. We will realize that the tools in our toolbox above are different views of the same solution, and learn how to encode our prior knowledge in model structure as well as parameters. Finally, I think we will re-focus our efforts on how far a model can extrapolate, rather than how accurately it can predict.

1. Call them deep if you must.

2. Sure, single-cell methods can provide enormous throughput, but we ultimately care about what happens in a whole organism, and often in a whole person.

### Linus Pauling on involvement in politics

I have to admit I didn’t know Linus Pauling’s second Nobel was for the Peace Prize. In addition to his scientific pursuits, Pauling helped bring about the treaty banning tests of atomic explosives in the atmosphere. Though six decades ago, it seems eerily relevant to discussion about scientist’s role in politics today:

The excerpt is from The Eighth Day of Creation, a comprehensive history of molecular biology’s beginnings.

### An Approach (and Template) for Reproducible Proposal Writing

This past year came with a significant increase in the number and complexity of funding proposals I put together. The process has been a bit of a learning experience, particularly in how important organization can be to larger writing projects. While I think the standard approach is to fight with Word, I’ve taken the opportunity to try and adopt some software tools (including Latex and Git) to make the process a bit more reproducible and straightforward. A few lessons I’ve taken away from the year follow below; if you would like to replicate the workflow I apply, please take a look at the template I start from on Github.

• Codify good practice. When putting together large documents I find it challenging to maintain consistency. For example, I would like acronyms to be introduced once, subsequently used throughout, and included in the glossary. The glossary package can seamlessly handle this for me. I would like all citations to be of the supercite variety and so overload the cite command. Aims and tasks end up subtly edited throughout the process of writing and proofing, and so these phrases are defined in commands. Make the computer work for you.
• Ensure final tweaking steps will be seamless. Inevitably, a few tweaks will be necessary to ensure the Specific Aims page is, in fact, a page, or that a figure isn’t hanging off the end of a page split. The main place where I adjust my practice in order to accomplish this is wrapping each figure into a command. That way, when I need to move a figure down, it’s just a single line making the trip.
• Use git as your track changes. A versioning system allows you to concentrate only on the document before you. What I’ve found works for me is to make commits roughly once a day, or when I hand the document to anyone else for proofing/approval. That way, I can continue to work on a document while someone else is looking through it, and when they provide edits I can instantly jump back to the point they saw. While folks in biology are most used to a Word document with track changes1, I tell people they can provide me comments however they find most convenient, and many are most comfortable scribbling on a hard copy.
• Separate style and content. A style file seems redundant since it all can just go in the preamble of a document but, at least for me, it helps to enforce that those are not settings I should be tweaking when trying to assemble the content.
• Respect your readers by being boring. If the purpose of your document is to convey information as efficiently as possible, eschew complications that do not directly serve that purpose. Those font ligatures and bright colors aren’t going to help your reader if it distracts them from learning what you want to tell them.

Obviously, these tips won’t make you have high-quality science to talk about, but have helped me streamline the process of getting that material onto a page without errors.

1. I also find track changes very distracting, so versioning being handled in a separate tool is helpful.

### Decoding cancer cells' molecular communication using systems biology

An essay I wrote for a contest. I didn’t win, but this serves as (I hope) a nice summary of the questions we’re trying to answer particularly for any TAM-interested folks.

Multicellular organisms require communication and coordination both within and between cells. Cellular information sharing occurs through a variety of components including cytokines, growth factors, and hormones, which diffuse across the cell membrane or use transmembrane carriers. These signals operate in concert, and so a relaxation of reductionism in the form of systems biology can more completely capture their effects. Receptor tyrosine kinases (RTKs) are a class of these receptors which transduce extracellular information to modulate and coordinate intracellular processes. Cancer is in part a breakdown of intercellular regulation, and cancer cells frequently utilize RTKs to drive many hallmarks of the disease.

Accordingly, targeting RTKs has been therapeutically effective in a subset of tumors. The ultimate benefit of these therapies, however, is limited by resistance. Resistance occurs through a panoply of mechanisms, including mutation of the drug target to block the effect of therapy, amplification of the drug target to overcome inhibition, and “bypass” switching to alternative pathways not targeted by therapy. In the case of RTK-targeted therapies, often non-targeted RTKs may become activated to provide bypass resistance. One RTK in particular, AXL, while not accompanied by oncogenic mutations, has frequently been identified as a resistance mechanism to targeted therapies. AXL potently drives metastatic dissemination of cancer cells at the same time, and so its activation is especially dire. Despite its importance as identified through genetic studies, little was known about the signaling function of AXL. Therefore, as a graduate student in Douglas Lauffenburger and Frank Gertler’s laboratories at MIT, I became interested in defining how AXL signals, to identify when and where targeting the receptor might be effective.

Information transmission through AXL. Left) Triple-negative breast carcinoma cells frequently co-express AXL and EGFR. Transactivation of AXL by EGFR serves to amplify a subset of pathways downstream of both RTKs. This amplification results in qualitatively distinct EGFR signaling and drives cell invasion. Right) The activity of TAM (Tyro3, AXL, MerTK) receptors depends upon interaction of their ligands with phosphatidylserine. Signal is transduced from lipid to ligand and receptor by constraining the diffusion of ligand-receptor complexes, leading to dimerization.

RTKs frequently act in concert to drive specific phenotypic outcomes. When cells develop resistance to RTK-targeted agents, it can involve co-activation of receptors, and inhibiting combinations of these receptors is then required to overcome resistance. However, the role of these combinations—where multiple receptors are simultaneously important to cancer cell survival—perplexed me. The combination must provide something not available through either receptor alone. We wondered whether RTK co-activation may play a role in redirecting cell response to extracellular growth factor cues. Examining the response of breast carcinoma cells to EGF, I found that EGFR transactivates AXL. Varying the amount of EGFR activation with and without AXL present showed that this serves to quantitatively amplify the activation of certain pathways, producing a qualitatively distinct signaling response. This pattern of activation potently promoted the migration response to EGF over direct EGFR signaling itself. Using a new experimental approach of chemically cross-linking receptors to one another and then quantifying the pairs of cross-linked receptors in parallel, I identified that diffusional proximity of receptor pairs was predictive of their cross-talk capacity. Thus, RTK co-activation not only can lead to therapeutic resistance but endows cells with novel phenotypic traits, and cross-talking receptors are closely localized. Through their communication, the sum of RTK activation was greater than the individual receptor parts.

But how was the activity of AXL itself regulated? RTKs are often, on a most basic level, growth factor concentration sensors. Taking advantage of this observation, RTK signaling is most commonly studied by removing growth factors from cell culture to reduce a receptor’s activity, reintroducing the growth factor after a period of time, then measuring the resulting dynamic responses. This basic experiment frequently does not work for AXL however; adding the AXL ligand Gas6 leads to no measurable phosphorylation response on its own. To explain this perplexing lack of response, I built a mathematical model of the receptor’s binding processes and fit it to our experiments lacking the expected activation. From there, it was clear: ligand was bound to the receptor, but the receptor-ligand complexes had to be brought together more tightly somehow for activation. AXL’s ligand, Gas6, simultaneously binds to a lipid, phosphatidylserine (PS), which is normally found only inside cells but is exposed during processes such as apoptosis, T cell activation, and photoreceptor turnover. The importance of PS for activating AXL has been recognized since 1997, almost as long as its ligand has been known, yet how the lipid functions to promote activation has remained elusive. In contrast to PS activating AXL through some conformational change transmitted through the ligand-receptor complex, our model pointed toward its role being to shepherd AXL-Gas6 complexes together. Indeed, by varying the amount of PS added to culture, we could see a biphasic response where very high concentrations of PS in fact inhibited AXL activation. As expected by the model, PS moieties served as a molecular corral for diffusing receptor-ligand complexes thus promoting activation of the receptor within limited spots on the cell surface. In a sense, the AXL-Gas6 complex itself is the receptor—for spots of PS presentation. To sense this important lipid complex, our cells have constructed this elegant, diffusion-driven sensor of ligand localization that is robust to changes in ligand concentration.

These studies provided a first quantitative analysis of how AXL functions within cells. Later studies have continued to elucidate the role of AXL in cancer including how and when it mediates resistance to targeted therapies in other tumor types. AXL and the TAM receptor family to which it belongs have accrued considerable interest due to their widespread roles in the immune system. The molecular features that drive their activation will help to understand where, how, and when these receptors are activated. In cancer, the microenvironmental changes that lead to activation of AXL will allow us to identify the patients who will benefit from targeted therapies against the receptor.

More generally, this work shows the essential role systems-level studies will play uncovering the etiology of and therapeutic opportunities in complex diseases. In each case, focusing on a single manipulation or measurement would have prevented us from learning how these systems of molecular processes function. Strikingly, even a single receptor-ligand pair can pass information from the extracellular environment to a cell in a subtle and complex manner. We will need the systems biology toolbox on even these relatively focused scales to decipher how molecular and cellular components function together to produce cells, tissues, and ourselves.

### Author Post: On Why We Build Models

The work from the final portion of my Ph.D. thesis is online today in Cell Systems, a brand new journal from Cell Press. Though nascent, the journal has already published exciting studies on the geospatial distribution of bacteria in cities, using CRISPR-Cas9 to rapidly engineer yeast metabolic pathways, and programming synthetic circuits in gut microbiota 1.

In it, we use differential equation modeling to understand how AXL (and very likely the other TAM receptor tyrosine kinases) senses phosphatidylserine (PtdSer)-presenting debris, a long-understood core function of the family. This was by far the most challenging undertaking of my Ph.D.–from utilizing the computational techniques to carefully designing the experimental measurements at each stage. The experience has taught me an enormous amount about the purpose and power of systems biology2.

Shou et. al. very recently described it best:

When scientists want to explain some aspect of nature, they tend to make observations of the natural world or collect experimental data, and then extract regularities or patterns from these observations and data, possibly using some form of statistical analysis. Characterizing these regularities or patterns can help scientists to generate new hypotheses, but statistical correlations on their own do not constitute understanding. Rather, it is when a mechanistic explanation of the regularities or patterns is developed from underlying principles, while relying on as few assumptions as possible, that a theory is born. A scientific theory thus provides a unifying framework that can explain a large class of empirical data. A scientific theory is also capable of making predictions that can be tested experimentally. Moreover, a theory can be refined in the light of new experimental data, and then be used to make new predictions, which can also be tested: over time this cycle of prediction, testing and refinement should result in a more robust and quantitative theory. Thus, the union of empirical and quantitative theoretical work should be a hallmark of any scientific discipline.

In a sense, kinetic rate equation models are fundamentally different from most data-driven approaches. These models of molecular systems make very few assumptions about underlying processes, meaning that we can not only learn from models that reproduce a behavior but often also from the ones that “break” and can’t fit the data. Relying only on experimental results doesn’t shield you from assumptions; in biology, experimental designs often rely on the underlying assumption that any one component of an organism has a unimodal relationship to the phenotype we observe. This is in part because the most simple (and often only feasible) experiments in one’s empirical toolbox are knockdown and/or overexpression, along with qualitative biochemical analyses. Biological systems are complex and nonlinear in their behavior though, and this initial view can quickly break down. For certain scales, a kinetic model can, in essence, be used as a scientific theory. Developed from underlying principles of rate kinetics and explaining the data we observe with as few assumptions as possible, it provides a unified framework for communication and further testing of our current understanding.

In the case of TAM receptors, manipulation by the addition or removal of ligand, receptor, or PtdSer has produced many observations. Some seemingly conflict with the previous mental model of particular factors being simply “activating” or “repressive.” The new theory/model that we propose, while being more complex, is maximally simple for the phenomena we wish to explain. With this new model, we can see how the previous mental model could be misleading, as complex, nonlinear relationships exist with respect to the timescale of one’s assay and factors such as PtdSer. That isn’t to say it is correct–it will always be wrong. In the near term, while we model only one receptor, AXL, the other TAM receptors MerTK and Tyro3 have critical roles both in normal physiology and cancer, and our understanding of how these receptors are similar or different is just beginning to be assembled. As TAM-targeted therapies are developed and evaluated in vivo, these models will help us understand how they work and develop even better therapies.

We need better methods at every step of this type of modeling, from construction and parameterization to understanding predictions3. Biology is complex, and mechanistic models such as these quickly become intractable on larger scales. Our study introduces the additional complexity of spatial scale, which makes each step of the process, as well as the corresponding experimental techniques, considerably more challenging. In the long term, I believe spatial organization of signaling will prove to be a critical component of understanding many cellular processes. We are going to need systems techniques to understand them.

1. I am incredibly excited by this journal’s creation. Systems biology has lacked a true “home” even as it has matured into an established field, and I am enthusiastic this will be it.

2. Ironically, I had to be convinced that such a project would be challenging enough to be interesting. After all, we learned about ODE modeling in undergraduate classes, it’s been applied for decades, and I would only be considering two proteins! Surely in the era of big data, a few rate parameters could be thrown together in a weekend and output some simulations of receptor activation. And rather than simulate a phenomena, why not just measure it experimentally?

3. Of course there are huge efforts to develop better tools for these purposes, but these remain difficult problems. Notable promising directions are rule-based models (such as BioNetGen) and brave attempts to try and accelerate Markov Chain Monte Carlo (e.g. DREAM). Truly rigorous modeling still remains a challenge even for the computationally adept however. It would be wonderful to have a rule-based framework that handled rigorous parameterization, spatial modeling through finite differences, compile-time detailed balance, and automatic differentiation (since all of these systems are stiff of course), but such a tool would be a considerable computational undertaking.

### Principal Components Explained Visually

Principal components analysis is an essential tool when working with multidimensional data, yet it can be difficult to develop an intuitive understanding of what the method is doing. This tool from Victor Powell really helps picture the process, as it lets you push and pull points around and immediately see the effect on the data output.

### The Bad Luck of Improper Data Interpretation

An article and news summary is out in Science this week with a bold claim: that two-thirds of all cancers are due to baseline mutagenesis intrinsic to cell division, and not environmental factors or genetics. This is based on observed correlation between the number of stem cell divisions and incidence of cancer in various tissues. Certainly, such a conclusion would have immense consequences—it would emphasize treatment strategies over those of prevention, and refocus efforts away from understanding environmental toxins. Sadly, this conclusion is based on a frightening variety of errors in interpretation and basic math.

Most specifically, the two-thirds figure comes from the correlation coefficient between stem cell divisions and cancer incidence. While it is the case that the former explains 65% of the variation in cancer incidence between tissues, this does not translate to a percentage of cancer cases. This data is plotted on a log-log axis, and so distance along the plot is not linear. As the cancers with clear environmental factors are more common, the 65% claim is surely much lower.

Second, while this correlation might explain variation between tissues, it does not suggest the source of mutagenesis. Any factor that had similar effects throughout all tissues would vary this plot on the y-axis but have no effect on the correlation. Notably, as the data is plotted on a log axis, even for tissue-specific toxins many fold changes in the incidences of these cancers would still have no effect on the conclusions of this study.

Additional failures of interpretation in this study suggest little understanding of the data analysis involved. For example, k-means clustering a single variable lends little insight, and with outliers on either end is sure to form two groups with separation near the center of the range. This provides no evidence of there being two “classes” of cancer.

This article seems to be the product of lax peer review and pressure to over-interpret data to boost public interest. Both of these provide short-term gain to those involved but in the long run corrupt the scientific literature and erode public trust in science. Don’t do it!

### Early Independence

Science doesn’t simply happen when money is spent; it requires immense effort, creative ideas, and dedicated time from well-trained scientists. Many factors can threaten these other requirements, such as funding instability and the aging of scientists. The average age at which an investigator receives their first R01, the mainstay grant of biomedical research, is now well into the mid-40’s. This drives many of the most talented individuals out of biomedical research, and curtails the benefit we as investors in biomedical research derive from those who remain, by limiting their ability to perform independent science during some of their most creative years.

To begin to address this one problem, the NIH has begun an experiment, the Early Independence Award, funding young investigators immediately after their Ph.D. so that they may undertake their own research independently. I’m excited to be officially joining this experiment, and hope you’ll see exciting work from the Meyer lab at MIT soon.

Archives