Data-efficient Machine Learning of Biomolecules

SCC and Helmholtz-AI, in cooperation with FZJ and DLR, publish a study in Communications Biology that proposes how modern and classical deep machine learning methods can be combined in a data-efficient manner.

Example system with correct predictions in green and incorrect predictions in yellow, main strand in turquoise and bases in purple

Translated with DeepL.com

Life is determined at the cellular level by various biomolecules. They constitute the machinery of living organisms and play a crucial role in the functioning of each cell. Machine learning is increasingly being used to study their function and related structure. Members of the Multiscale Biomolecular Simulation research group and the Helmholtz AI team, in cooperation with Forschungszentrum Jülich and the German Aerospace Center (DLR), have now proposed a method to combine modern and classical deep machine learning methods to build models even in data-poor scenarios.

The researchers use a deep learning approach to predict spatial neighborhoods between RNA building blocks (called nucleotides). Similar to what happens in a LEGO model, when individual Lego bricks are replaced in one location, the bricks in the neighborhood must adjust so that the entire structure still fits together. The BARNACLE model proposed in the study uses this idea for RNA: nucleotides that are spatially close together in RNA are also more likely to mutate together during evolution. And it is precisely these emergent mutation patterns that the model looks for. To train the model, it relies on a combination of self-supervised pre-training on lots of sequence data and efficient use of the few structural data. BARNACLE showed significant improvement with this approach over established classical statistical approaches but also other neural networks. It also shows that the method is transferable to related tasks with similar data constraints.

The results of this study were published in the paper "RNA Contact Prediction by Data Efficient Deep Learning" in the journal Communications Biology.



Achim Grindler