ABSTRACT

Sequence classification is a common task in modern virus bioinformatics research. DNA, RNA, or protein sequences are either filtered for certain properties or the properties of a sequence are to be determined. This task is a very diverse problem. The previous knowledge about the data and also the amount of usable data differ for each project. Also the classification task itself is highly diverse. An additional difficulty is that even today for most biological questions, especially in virology, we lack some set of measurable properties (features) that always explain our observations. Here, we introduce machine learning for viral sequence classification. Together with the reader, we build a deep neural network (DNN) pipeline to classify the host of an influenza A virus from its genome sequence with great accuracy. This result may be somewhat surprising since, despite years of research, we lack a set of properties that lead to highly accurate predictions, and currently, more exceptions are often found than new features. Deep learning can automatically identify a trainable set of features and their dependencies with higher predictive power than previous approaches. This work may serve as a starting point to encourage researchers in virology to use machine learning. Using viral host prediction as an example, we will be discussing classical pitfalls such as data quantity and quality.