AI in Bioinformatics: From Genomic Variants to New Proteins

Olga works in bioinformatics, building neural networks for geneticists and biologists. The core question is what artificial intelligence can do in this field today.

There are many applications of AI in biology now. For instance, a neural network can assess a photo of a child to estimate the likelihood of a rare genetic condition. A company called Face2Gene demonstrates this: upload a photo, and the system analyzes facial features to guide clinicians toward which genes to investigate for accurate diagnosis.

New neural networks have started helping bioinformatics teams understand how a mutation that does not code for a protein might still impact disease. People often ask what a non-coding mutation means. The genome has two main parts: segments that code for proteins and non-coding regions often referred to as junk DNA. It turns out this non-coding DNA plays important roles, including gene regulation.

Mutations in non-coding regions are tricky to interpret because it is not always possible to predict their effect on disease. Not every genome change is harmful. In professional circles, the term mutation is often replaced with variant, which can be benign, pathogenic, or of unknown clinical significance. Most variants do not cause disease.

Guidelines exist for judging how variants in protein-coding regions affect health.

How can neural networks assist with that?

The team developed a DeepCT neural network that suggests the role a mutation may play across different cell types. The human genome is the same in every cell, yet cells become kidneys, hearts, or brains because a set of genes is turned on or off in each context. For example, genes critical to heart muscle activity are not active in brain cells.

DeepCT is not a finished clinical tool, but it can forecast the potential role of mutations. Its predictions still require validation. Google has an Informer tool based on neural networks that predicts gene expression in a broad sense, not only a gene’s link to specific proteins. Informer addresses whether a mutation impacts protein synthesis and whether a gene will be active in cells.

Are there similar efforts by researchers elsewhere?

Inspiration led to the creation of an analogue called Deep CT. The field already has language models for DNA akin to natural language processing, but existing networks were often private or less effective. The team trained their model, named GENA, which is accessible to scientists worldwide. It can learn patterns from short DNA sequences and extend those insights to longer patterns.

Can GENA analyze longer segments? If DNA analysis depends on context, what determines the length of the sequence used?

Sequence length matters because context matters. The longer the sequence provided, the more the model can identify patterns. The aim is to extend the window the network can analyze. The first version of GENA handles about 3,000 nucleotides, while a newer architecture accepts up to 24,000 nucleotides. Both versions are publicly available, and researchers have downloaded them over a thousand times in recent weeks.

There is also a proprietary architecture called RMT — Repetitive Memory Transformer — developed by AIRI in collaboration with colleagues from Moscow Institute of Physics and Technology. RMT handles long text strands, even potentially unlimited lengths, and has shown success with sequences of 1 to 2 million characters in certain tasks. In genomic experiments, researchers are exploring processing more than 24,000 letters and preparing a formal publication.

Has GENA revealed new patterns linked to mutations?

Yes. GENA highlights regions of biological significance. Even when literature offers no statements about a genome region’s effects, the network may uncover previously unknown signals. Such discoveries warrant careful study to gradually reveal more about the genome.

Expansion of ideas continues. One avenue is applying GENA to metagenomic data from bat droppings to see if it can distinguish diverse viral and bacterial genomes in a sample. This work remains exploratory.

Are there neural networks predicting the likelihood of genetic diseases in a fetus?

These networks are still under development. In the meantime, genetic counseling remains the standard recommendation for couples planning a family. A geneticist can assess the risk of rare monogenic diseases based on family history and testing, independent of neural networks. If a high-risk pathogenic mutation is found in healthy parents, options like IVF with preimplantation genetic testing exist. Today, embryos can be screened for chromosomal abnormalities or broader genome profiles using established algorithms, with neural networks beginning to contribute to more comprehensive analyses.

Questions arise about global practices. A glimpse from Russia mentions a team at a research institute that has developed a distinct embryo-screening method. It combines sequencing with analysis of how chromosomes arrange in the nucleus, potentially revealing how genes interact to regulate each other. This line of work aims for more complete genomic insight.

How do these networks support virology research, such as coronavirus studies?

Neural networks are being used to model viral mutations and to explore how vaccines might need adaptation for new strains. The focus often centers on the S-protein, a key component that interacts with human cells. Understanding how epitopes — antibody binding sites — shift with mutations helps evaluate vaccine effectiveness. A collaboration produced a neural network that analyzes a virus protein sequence and maps epitope locations. If changes alter those epitopes, vaccine updates may be required.

There is also work on glycans — sugar-like structures that cloak proteins — which can hinder antibody access. These insights help assess immunogenicity and appraise whether a candidate antibody-based drug could work against a virus. While initially designed for coronaviruses, the approach now aims to study a broader range of pathogens.

Currently, efforts are underway to build a framework for creating a completely new biologically active protein. The team is exploring a similar model, focusing on the early stages of development and potential future applications across biology.

What are You Looking For?

Advances in AI for Bioinformatics and Genomic Research Across North America

Officials Update on Stolby Park Cliff Tragedy in Krasnoyarsk

Analysts examine Russia's push for sovereign internet governance and a domestic knowledge ecosystem