ABSTRACT

Viruses are present in every ecosystem on Earth, and often vastly outnumber their cellular hosts. The global phage diversity itself is very broad, including both RNA and DNA genomes, with size ranging from a few kilobases to half a megabase, and a high proportion of genes unique to each individual phage and not detected in any other genome. Long-reads metagenomics holds fantastic promises for phage diversity exploration, including the recovery of hypervariable regions of phage genomes which are often not assembled from short reads. Two main approaches have been proposed to identify known and novel phage genomes in an assembled metagenome, with variations around these two approaches declined across multiple tools. The task of gene prediction in phage genomes presents a challenge for conventional whole genome prokaryotic gene finders such as Glimmer or Prodigal due to several phage-specific features: a short genome length, more frequent overlapping genes, possible programmed frameshifts, and unknown genetic code.