Utilising Machine Learning Models to transform Plant Genomics
Posted 25th February 2019 by Joshua Sewell
A lot of machine learning is used in technology such as Google Assistant, Amazon’s Alexa or Apple’s Siri, or to get rid of spam in your email inbox. Deep learning and rapid development of this technology enables us to solve image classification problems – e.g. “does this picture contain a dog or a cat?”. Also, artificial intelligence is set to soon replace many human jobs – in the darkest views, it might even pose an existential threat to the human race.
What repercussions do these developments have for genome-related research and in particular plant genomics?
Defining the relationship between AI, ML and DL
To get started, it is useful to first discuss how these different technologies, “deep learning”, “artificial intelligence”, “machine learning”, actually relate to each other. A widely accepted relationship makes clear what we are talking about (see figure 1).
Artificial intelligence (AI) is the broadest of these three terms – it refers to the simulation of human intelligence processes by a computer: learning, reasoning and self-correction. Machine learning (ML) refers to computational models that effectively perform a specific task without using explicit instructions, but rather learning from data in order to make predictions or decisions. Finally, Deep Learning (DL) is a subset of ML – DL refers to a specific type of ML algorithms which are very good at specific tasks.
In the rest of this piece, I will write a few thoughts on the use of machine learning in science, which relates to a talk I will give in May 2019 at the 7th Plant Genomics and Gene Editing Congress in Rotterdam; this talk partially relates to our recent paper on using ML to predict cross-over patterns in plant genomes.
Machine learning in Plant Genomics
ML has been used extensively in genome bioinformatics. The fact that these algorithms learn from data rather than starting by positing an explicit detailed model which makes biology very amenable to machine learning in many cases. Recently, there has been renewed interest in the use of machine learning in bioinformatics and computational biology, inspired by successes of deep learning in various technological applications, in particular related to image analysis.
Examples of models which have been developed over the last few years to predict and analyze genomic properties include DeepBind or DeepSEA. These and other examples indicate that for certain applications in genomics analyses, deep learning shows clear promise.
How is all of this relevant for plant genomics? One observation is that the large scale datasets which are needed to train the above mentioned deep learning applications are not yet available for plants. In the near future one would expect that DL will also be applied to address various plant science genomics related issues.
I would speculate that one aspect of plant genomics which could be addressed would be how to deal with multiple different species simultaneously. Possibly deep learning approaches might prove to be able to address comparative genomics type of analyses, or the transfer of knowledge from a model plant to a crop of interest.
Apart from these thoughts for the future, machine learning is already being applied extensively in plant genomics as well. One can think of various types of analyses of expression or sequence data, aimed at predicting functions of genes or differential effects of gene expression on a certain trait.
ML and predicting cross-over patterns
We analysed sets of cross-over locations from four different plant species, and trained a model to predict these locations based on DNA sequence. To do so, the DNA sequence was labelled with various descriptors (e.g. short sequence motif, but also predicted DNA shape, etc). Depending on the species, the models showed reasonable to good performance. Our next step was to have a look at what features were used by the models in order to make these predictions. Comparing these features between species indicated some interesting similarities but also differences.
This type of ‘feature ranking’ is an aspect of machine learning which is often overlooked – in particular in many of the deep learning approaches mentioned above, emphasis is very much on how good the model predictions are, and not on how the model is able to make its predictions.
Depending on what approach one uses, it is possible to ‘open the black box’ and learn from the patterns detected in the data by the algorithm. In particular for applications related to molecular biology, I feel that attention to such ‘transparent’ models is warranted. It does hold the promise of gaining insight, if not by the computer, at least by us humans.
Aalt-Jan van Dijk is Assistant Professor in Plant Systems Biology at Wageningen University, Netherlands. He will be speaking at the 7th Plant Genomics & Gene Editing Congress: Europe.
The agenda for the upcoming 7th Plant Genomics & Gene Editing Congress: Europe is available for download. Featuring three tracks dedicated to Plant Genome Engineering, Plant Phenotyping & Bioinformatics, and Plant & Soil Microbiomes.