ABSTRACT

The International Nucleotide Sequence Database Collaboration (INSDC) was primarily comprised of the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. Data are freely interchanged and exchanged on a daily basis between these three databases and sequence information is interconnected via a detailed data linkage system. BioSample (Europe) and NCBI BioSample/BioProject (USA) are archives for data about the origin, storage methods, and location of specimens (also called metadata) that have been used in research in either academia or industry. Short Read Archives (SRA) store data from next-generation sequencing (NGS) approaches. Many of the SRA are linked to BioSamples and BioProjects and so it is quite easy to navigate back and forth between these three databases. The Transcriptome Sequence Assembly (TSA) database is an INSCD database for the storage of downstream assemblies of transcriptomic SRA data. Literally hundreds of databases for the storage and retrieval of archived sequence information exist independently of INSCD. In 2019, the journal Nucleic Acids Research (NAR; Oxford Journals) published its 26th annual issue on bioinformatics databases (Galperin et al., 2019). The INCSD websites at NCBI are perhaps the most heavily used in genomics. One major caveat we can make about these databases concerns their ephemeral nature. Databases like NIH GenBank and the other INSCD databases are more than likely going to be stable for many decades.