Algorithms to Store and Retrieve Two-Dimensional (2D) Chemical Structures

doi:10.1201/9781420082999-6

ABSTRACT

Since molecules are three-dimensional (3D) and lack any intrinsic ordering in their chemical formulas, it becomes necessary to supply a set of linear rules to any computer system designed for storing and retrieving chemical structures. Ideally, such a notation would not only retain important knowledge about a molecule’s 3D structure but also contain a mechanism to distinguish between molecules. As we saw in the previous chapter, graphs are dimensionless or zero-dimensional (0D) objects but can also be considered as one-dimensional (1D), two-dimensional (2D), 3D, or higher-dimensional objects in various methods of structural representation. Thus, 0D or constitutional descriptors such as molecular weight and atom counts are defined using local molecular information. 1D notations for structures and reactions include linear representations such as SMILES (Simplified Molecular Input Line Entry Specification) [1,2], WLN (Wiswesser Line Notation) [3,4], SLN (SYBYL Line Notation) [5,6], and InChI (the IUPAC International Chemical Identifier, https://www.iupac.org/inchi). Molecular graphs can be represented in two dimensions as chemical diagrams such that a vertex corresponds to (x, y) coordinates

and type of an atom and an edge corresponds to bond type. They can also be extended to three dimensions such that a vertex contains information about (x, y, z) atomic coordinates instead. 3D structures can then be generated by further incorporating knowledge of bond lengths, bond angles, and dihedral angles. In this chapter we illustrate some methods and algorithms for the storage, retrieval, and manipulation of 2D representations of chemical structures, while 3D representation is treated inChapter 3.

The information contained in molecular graphs can be transmitted to and from a computer in several ways for the purpose of manipulating chemical compounds and reactions. It is essential for a particular chemoinformatics application to recognize molecules of interest by recognizing relevant geometric and topological information passed to it. This can be accomplished by representing a molecule using line notation (1D) or as a connection table (2D and 3D). Linear notation is a compact and efficient system that employs alphanumeric characters and conventions for commonmolecular features such as bond types, ring systems, aromaticity, and chirality. The connection table is a set of lines specifying individual atoms and bonds and can be created as a computer-and human-readable text file.