Using LLMs to Learn About AlphaFold

Deep Gan Team
3 min readFeb 2, 2024

Learning the Paper and the Tools

To expand our knowledge in Deep Learning, we dived into the topic of the AlphaFold model and biology. As deep learning practitioners, we may understand aspects of the model, but we were unaware of the biology and chemistry concepts that went into the design of the model. So as we learned, we incorporated the now modern tools we have available to help us understand the model and literature we were reading. While this article was not written by an LLM, we used them to help us understand the ideas better.

ChatGPT and Bard

One of our main goals was to become familiar with the general concepts and key ideas introduced by AlphaFold even without a background in biology. Being new to the field of biology, we were unfamiliar with many of the concepts and terminology contained in the AlphaFold paper. We found that using ChatGPT and Bard provided an excellent starting point to learn these unfamiliar concepts. Looking forward towards other future projects in unfamiliar fields, we feel that these AI technologies could become even more useful to people trying to learn new concepts when a human expert is unavailable.

Data

The input data for Alpha Fold network structure are protein sequence, multiple sequence alignments (MSAs) and pairwise features. The amino acid sequence of the protein is the fundamental information that dictates how the protein will fold. The output represents predicted protein structure and the confidence of structure matches, where the protein structures are mostly coordinates of proteins. These 3D structures can be visually rendered.

It’s also important to note that AlphaFold does not simulate protein folding directly in order to make predictions of the protein structure. Instead, it approaches the problem as a machine learning problem where it learns patterns from data to predict the final protein structure. The physical shape of a protein is important because the shape helps to determine how a protein functions.

You can find the training dataset here: https://www.alphafold.ebi.ac.uk/download

Alignment

In order for Alphafold to work, it needs to be able to align different proteins together. One of the tools that Alphafold uses for this process is Multiple Sequence Alignment (MSA). Multiple Sequence Alignment aligns multiple sequences from different organisms together for the purpose of finding similar regions among the different sequences. Alphafold also generates a pair representation which is a matrix of pairwise representation. A matrix is able to represent a graph with input and output edges which is used for the pairwise representation. This matrix can then be processed by the neural network.

Evoformer

Image copied from Alphafold 2 Nature (https://www.nature.com/articles/s41586-021-03819-2)

The Alphafold 2 model uses a transformer-based block called an Evoformer. One of the differences between Alphafold 1 and Alphafold 2 is that Alphafold 1 is primarily based on CNNs while Alphafold 2 uses transformers. An evoformer block takes in an MSA representation (similar known amino acid sequences in organisms) and a pair representation (distance between parts of protein). The MSA sequence undergoes self-attention to get features of different amino acid interactions. This information then gets used to update the pair representation. Finally, the pair representation is updated to respect euclidean geometry with its various triangle updates.

Training

From the AlphaFold protein database, the AlphaFold architecture is able to train to high accuracy using only supervised learning on PDB data. AlphaFold also uses augmented data sets similar to a student self-distillation approach. The training is a combination of labeled and unlabeled data to improve the accuracy of the resulting network.

Prediction

The structure module converts the intermediate data into a 3D protein structure. AlphaFold also predicts the 3D structure of each of the atoms in the predicted protein structure.

In the paper, PyMOL is used to allow users to rotate, zoom, and explore the structure in 3D, visualizing specific residues, interacting regions, and potential binding sites.

Conclusion

In this blog post, we learned about some of the basic concepts used by Alphafold. We also saw how new AI technologies such as ChatGPT and Bard could help introduce us to new and unfamiliar concepts.

--

--

Deep Gan Team

We’re a team of Machine Learning Engineers exploring and researching deep learning technologies