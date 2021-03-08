Shlok Shah

On the 30th November 2020, Google’s DeepMind released AlphaFold2, a machine learning program that transformed the entire field of biology, by solving ‘The Protein Folding problem’, a problem that had stumped scientists for over 50 years.

In order for us to understand what the protein folding problem is, we first must have a basic grasp of the structure of proteins.

Proteins are a sequence of amino acids that have no predefined length. They fold up in stages: from their primary polypeptide chain to their final 3-dimensional quaternary structure. There are 22 different amino acids in existence that can be strung together in different ways to form a protein. All amino acids have the same basic structure – the only difference being the variable group. This variable group uniquely identifies each amino acid from another. Due to interactions between the chains, each protein folds into a highly specific structure, specialised to a certain function.

Proteins are used to make specific molecules such as hormones and enzymes which are essential for the human body to function. An example of a highly specific protein in humans is haemoglobin, which is found in red blood cells. The shape of haemoglobin is specialised so that a particular haem (a structure that contains the iron ions to which the oxygen can attach) can slot into the haemoglobin perfectly in order to allow oxygen to be transported to all parts of the body.

So, what is the protein folding problem? The protein folding problem is a question that arose by asking how the primary string of amino acids affects the final three-dimensional quaternary structure. In 1972 it was understood by Nobel Prize Winner, Christian Anfinsen, that the amino acid sequence solely determines the quaternary structure of the protein.

For this reason, a quest began, to be able to predict this final structure of the protein from the initial polypeptide chain. The reason this is such a difficult challenge is because of the vast number of combinations that a protein can theoretically fold up into; there are estimated 1.3×10130 permutations that a protein can be folded. In order to demonstrate the sheer size of that number, there are only estimated to be 1×1080 atoms in the universe.

AlphaFold2 is a computer program that was created as a part of the CASP (Critical Assessment of Protein Structure Prediction), a competition that originated in 1994 where different corporations test their machine learning programs that they have developed using a public data set of just 170,000 proteins. A beta version of AlphaFold2, AlphaFold, was announced as the winner of the 2018 competition achieving a score of 58 GDT (a measure of how close the computationally predicted structure of the protein is to the true structure of the protein). This was significantly better than any other algorithms produced by any other company over the last 24 years, a total of 18 points better than the most recent winner of the 2016 competition.

Google’s DeepMind highly improved on the first version of AlphaFold and so it was entered into the 2020 CASP competition. There it achieved a score greater than 90 GTD and thus many have deemed the protein folding problem solved. AlphaFold2’s overwhelming success can be attributed to the method of machine learning they used.

The two most common types of machine learning are supervised learning and unsupervised learning. Supervised learning is the process of training an algorithm with labelled inputs connected to particular outputs. The goal is for the algorithm to be able to categorise a new and unfamiliar input after being exposed to a large data set by the end of the training period.

Unsupervised learning, on the other hand, is where the algorithm filters through unlabelled data and identifies a pattern in order to group this data. AlphaFold2 used a combination of these, known as semi-supervised learning, as it allows greater accuracy with a small dataset, perfect for the CASP competition.

This revolutionary program is considered a large breakthrough in both the fields of biology and computer science and has the possibility to be used in thousands of applications.

There are many diseases that take millions of lives each year for which we do not have enough information to cure them. An example of this is trypanosomiasis, which has a mortality rate of close to 100% when left untreated. This sub-Saharan African disease kills up to 50,000 people each year. There is little truly known about the protein itself that causes this disease but the use of AlphaFold2 could be used to find the structure of the proteins that make up this deadly parasite which could, in turn, accelerate the rate at which a drug for this disease can be produced. Normally, it can take over a decade and millions of dollars to produce a drug, but this revolutionary program can be used for more efficient drug discovery.

Another exciting use that AlphaFold2 can potentially bring is the production of proteins and enzymes that break down plastic waste in the ocean into organic compounds as well as proteins that capture carbon from the atmosphere, decreasing the effects of global warming.

