Using Machine Learning to Assign Gene Ontology
Posted 11th May 2018 by Jane Williams
It is a quite common phenomenon in biology: we have identified a gene or a set of genes that are likely to be important for the system we are studying, but we have no clear explanation for why they are important or how they influence the system. For example, in plant genomics, we are often interested in a specific trait and have a genomic locus or a list of candidate genes that might affect this trait. The question is: how can we prioritise these genes in order to proceed to experimental validation?
To this end, it would be extremely useful if we knew the functions of our candidate genes. For instance, if a gene codes for an enzyme, what type of reaction does this enzyme catalyse? And in which pathways might this reaction take place?
Gene Ontology and lack of knowledge in plants
The Gene Ontology (GO)  is a set of terms that describe gene and protein function in a systematic way. “Function” can be defined in the GO as Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). Unfortunately, for non-model species, GO annotations of genes are very scarce. In 2017, there were over 16,000 UniProt entries from Arabidopsis thaliana with at least one GO function assigned to them. For rice and tomato, this number drops below 8,000 and for many other interesting crops, such as potato and wheat, below 100. Note that this is an underestimation of how little we know, as having one or a few GO annotations for a gene does not mean that we have discovered all its functions.
Experimental vs Computational Function Discovery
How can we discover the unknown functions of genes? One way would be to perform biological experiments. However, these experiments are often very laborious, take a lot of time and money, and are typically focused on a single gene or a single function. Alternatively, we could have a computer algorithm that takes a gene as an input and uses information from already experimentally derived GO annotations to output predictions for the functions of that gene. Such algorithms can provide GO annotations in a genome-wide scale in a matter of hours or at most days for essentially any species.
Computational methods are not accurate enough
The potential advantages of computational function prediction have led to the development of a plethora of such algorithms , . However, as community-wide benchmarks have shown, the accuracy of these algorithms is still far from perfect .
Why is that?
In a joint project between the Pattern Recognition and Bioinformatics group of the Delft University of Technology in the Netherlands and Keygene, a Dutch agricultural biotechnology company, we sought to understand the causes of this apparent “failure” of all these different function prediction methods. We hypothesised that perhaps the problem in its current formulation is too difficult and this is why it is hard to obtain accurate predictions.
The number of functions that the models have to check is vast (over 40,000 in the BP only) and they are often similar to each other. To understand why this is an issue, consider the following analogy: You are shown the picture of Figure 1 and are asked whether the object in the figure is an orange or an apple.
Certainly, you can tell it is an orange and not an apple without having to think about it at all.
Now imagine that you are shown the same picture for the first time and you are asked whether the object is an orange, a sanguine, a tangerine, a bergamot, a kumquat, a grapefruit, a lemon, a lime or a citron. This is arguably a more challenging problem because you have more categories to choose from and fruit from these categories often look similar to each other. Something similar holds for automatic function prediction. There are a lot of GO terms to choose from and these terms often describe similar functions.
Simplifying the problem
To address these issues we developed a method to reduce the size of the problem and hopefully make it easier. At the core of our method is a well-known technique in machine learning called Principal Component Analysis or PCA . PCA is often used to reduce the number of dimensions in a problem. For instance, we can describe human beings using their height and their weight (two variables/dimensions). These two variables are not independent, though. The taller a person is, the more they are expected to weigh (on average). In other words, a person’s height tells you something about their weight. Very roughly, PCA exploits such correlations between variables and finds new variables that carry (almost) the same information by combining the current ones. In the height-weight example, you could think of describing people using their Body Mass Index, which is a combination of height and weight. As a result, we have reduced the number of variables/dimensions from two to one. Moreover, PCA removes noise from the data and maintains the essential, “core” information.
How it works
The same principle can be applied to GO terms. We represent every GO term using one variable and we apply PCA to obtain a smaller set of “intermediate” variables that contain the essential functional information. Then, we can easily adapt any existing automatic function prediction method to recognise and make predictions on the new variables. When we “feed” a new gene to our model, it will give us predictions for the intermediate variables, but these are hard to interpret. Conveniently, our reduction is constructed in such a way, that we can easily transform these predictions into predictions for GO terms, which are more human-readable.
This concept of predicting using intermediate variables from PCA was first introduced in the machine learning community by Tai and Lin . We have adapted this method using a weighted version of PCA. Without getting too technical, our method (SEM) takes into account the correlations between the GO terms – as regular PCA does-, but also their similarities on a semantic level , i.e. how similar concepts they refer to.
We tested three existing function prediction models on proteins from A. thaliana, as only for this plant we have enough annotations to make robust evaluations. The performance of the models is compared under three conditions: the original models trained on GO terms, their counterparts that operate on the variables found by PCA and those that use the variables from our SEM method. The results are shown in Figure 2. All three tested models are improved by the use of dimensionality reduction.
|Figure 2: Performance (y axis) of three models (blue, red, yellow) in their original form, combined with PCA and combined with SEM. Performance is measured using the semantic distance , where lower values correspond to better performance.|
In conclusion, we are still far from achieving our goal, which is to obtain highly accurate predictions for the functions of plant genes. We have, however, used knowledge from computer science and machine learning to make the problem a little easier.
Stavros Makrodimitris is currently a PhD candidate in the Pattern Recognition and Bioinformatics group at Delft University of Technology.
Stavros Makrodimitris will be speaking at the 6th Plant Genomics & Gene Editing Congress: Europe. Find out more about the event here.
Leave a Reply