About MolDB

tdudgeon · 7 December 2022 10:01

This topic is a guide to the MolDB - the molecular database that can be used to to generate inputs for ligand and target based virtual screening and other work.

Overview

MolDB replaces the sharded molecule system. It provides similar capabilities but is accessible from multiple Squonk projects.

MolDB is manifested by a PostgreSQL relational database into which molecular data is loaded and selected molecules extracted for use in virtual screening.

Population of the MolDB database typically operates as follows:

Load vendor databases into the database
Calculate molecular properties that are useful for filtering molecules suitable for screening
Enumerate molecular forms of those molecules (microstates, tautomers, stereoisomers)
Generate low energy 3D conformers of those enumerated forms.

Following that molecular forms or low energy 3D conformers suitable for virtual screening can be extracted. More details are provided below.

Database model

Simplified database model (only most relevant columns shown):

|--------------|         |--------------|         |--------------|
| supply       |         | file         |         | library      |
|--------------|---------|--------------|---------|--------------|
| smiles       |*        | name         |*        | name         |
| code         |         |              |         |              |
|--------------|         |--------------|         |--------------|
        |*
        |
        |
        |
|--------------|         |--------------|         |--------------|
| molecule     |        *| enumeration  |        *| conformer    |
|--------------|---------|--------------|---------|--------------|
| smiles       |         | code         |         | coords       |
| mol          |         | smiles       |         | energy       |
| <mol props>  |         | coords       |         | energy_delta |
|--------------|         |--------------|         |--------------|

The purpose of the tables is as follows.

table	description	key columns
library	an individual vendor library e.g. ‘MolPort Available Compounds’	name: library name
file	a file that was loaded (some libraries are composed of multiple files)	name: file name
supply	an individual molecule (as specified by the supplier e.g. not standardised) from a file, one record for each entry in the file	smiles: molecule SMILES code: supplier’s code
molecule	individual standardised molecules (many rows in supply can refer to the same row in molecule)	smiles: standardized molecule SMILES mol RDKit molecule with cartridge index (1) mol props: column for each property
enumeration	the enumerated forms of each molecule	code: enumerated form smiles: enumerated SMILES coords: 3D coordinates (2)
conformer	the low energy conformers of each enumerated form	coords: 3D coordinates (2) energy: conformer energy energy_delta: difference in energy (3)

Notes:
(1) The RDKit cartridge allows substructure searching, but may not be enabled for a particular database
(2). 3D coordinates are specified using ChemAxon extended SMILES (cxsmiles) where the coordinates are defined in the “comments” string. The full cxsmiles of an enumerated form is created by combining the smiles and coords columns, with a space separating them. The full cxsmiles of a conformer is created by combining the smiles of the corresponding enumerated form and coords column of the conformer, with a space separating them.
(3). The energy_delta value is the difference in energy between the conformer and the lowest energy conformer of that enumerated form.

Using MolDB

Access to MolDB is through Squonk Jobs that are located in the virtual-screening GitHub repository. In the Squonk Data Manager you will find these jobs in the moldb collection.

Loading molecules

See job docs.

A vendor file, supplied as a text file containing molecules in SMILES format can be loaded into MolDB. This is done using the moldb-load-library job. Note that this job may not be accessible to you and might need to be run by an administrator of the system. If so then you should contact the administrator if you want a new library to be loaded.

The moldb-load-library job takes a vendor file as input and allows you to load it into the database, populating the library, file, supply and molecule tables with data. See the docs for the job for specific details.

Calculating molecular properties

See job docs.

The molecule table has columns for a range of molecular properties that can be used to filter the molecules. These are calculated after the library has been loaded to avoid calculating properties for molecules that have already had them calculated. To do this run the moldb-calc-props job. This extracts a specified number of molecules from the database, calculates the molecular properties and then loads those properties into the database. Potentially this job has to be run multiple times if there are more molecules in the database that need properties calculating than the number that are extracted and calculated (the count option for the job).

Like loading molecules, this step probably has to be done by the administrator, and you can usually assume that is has already been performed.

The properties calculated are:

number of heavy atoms (hac)
number of rotatable bonds (rotb)
number of rings (rings)
number of aromatic rings (aro_rings)
number of chiral centres (chiral_centres)
number of undefined chiral centres (undefined_chiral_centres)
number of SP3 hybridised carbon atoms (sp3)
Crippen logP (logp)
topological polar surface area (tpsa)

These properties are calculated using the RDKit toolkit.

Selection criteria

When you are selecting inputs for virtual screening (ligand based or target based) you would normally have a good idea of the characteristics of the molecules you are looking for, most particularly for the size of the molecules (heavy atom count property), but often for other properties like the number of rings or the number of chiral centres. Using the molecular properties that have been calculated you should therefore decide on a set of molecular property filters that you want to use and apply them consistently to the subsequent steps.

Enumerating molecules and generating 3D conformers is a computationally expensive process (largely because of the huge number of molecules involved. We therefore take the approach that these should only be performed when needed, and for any molecule, should only be done once. Thus, once you have chosen your molecular property filters you need to run a job that enumerates all the molecules that conform to your filters that have not yet been enumerated. And if you are wanting low energy conformers you must also do the same to generate the conformers. If you are lucky then all molecules will already have been processed and these steps will be very fast, but if not then these steps can take a long time to complete.

To generate your filters you need to create a specification file. This is a simple text file that lists
the different filters you want to apply. You can also specify substructure searches e.g. if you are wanting to select reactants compatible for a reaction. You can have different specification files for different purposes
e.g. each virtual screening target will probably want its own set of filters.
Then, when executing any of the jobs that use a set of filters, you specify to use the particular specification file you want.

An example specification file (by default named specification.txt might look like this:

sss = [C:1](=[O:2])-[OD1]
min_hac = 16
max_hac = 24
max_rotb = 5
max_logp = 4
max_chiral_centres = 2

The meaning of those terms should be obvious. The names are as listed above and are prefixed with min_ or max_. Values are integers except for logp and tpsa which can be decimal numbers.
These terms are converted into filters using less than or equals or greater than or equals operators.
e.g. if you specify the filter min_hac = 14 then this is effectively the term hac >= 14.

The term sss is used to specify a SMARTS expression that is used as a substructure search (SSS) filter using the RDKit cartridge (the mol column in the molecule table). This can be specified multiple times, in which case the SSS filters are combined with OR logic (in all other cases AND logic is used).

These next steps describe how to use this specification file.

Enumerating molecular forms

See job docs.

This is done using the moldb-enumerate-mols job. You specify the specification file you want to apply and the number of molecules to process and this job will then extract up to that many molecules that pass those filters and generate the possible microstates (charge forms), tautomers and stereoisomers (only of undefined chiral centres) and then load them into the enumerated table. This is potentially a slow process, depending on how many molecules that pass your filters have not yet been enumerated. And if there are large number (more than the count option specified) you may need to run this multiple times to generate forms for all the molecules that pass your filters.

You don’t have to run this step before extracting the molecules to screen, but if you don’t run it you may be missing large numbers of candidate molecules.

Whether you are able to do molecule enumeration depends on your access rights. You may only be able to use molecules that have already been enumerated.

Generating low energy 3D conformers

See job docs.

This is done with the moldb-gen-confs job which generates a number of low energy conformers for each enumerated form for the previous step.

Again, you must specify your specification file with your required filters and enumerated forms that need conformer generation will be extracted, have conformers generated and those conformers loaded into the conformers table.

The same caveats about needing to run this to completion to ensure you are able to extract all candidate molecules, as well as the need to have already run the moldb-enumerate-mols job to completion first (and with the same filters).

Whether you are able to do conformer generation depends on your access rights. You may only be able to use molecules that have already had conformers generated.

Finding out what needs doing

See job docs.

It will not immediately be clear to you which molecules need enumerating and conformer generation, as in some cases this will be done by another person. To assess the state of the system you can run the moldb-analyse job which runs a set of analyses on the database and creates a report with a summary of the status.

Again, the moldb-analyse job can take a specification file which allows it to do extra analysis to check what the situation is in respect of the molecules you are interested in (pass your filters). If you don’t specify a specification file then the report is more basic.

A typical report file looks like this:

Database: postgresql://postgres:********@localhost/postgres
Date: 2022-11-08 16:03:08.126628
Filter terms: {'min_hac': 24, 'max_hac': 26, 'max_rotb': 5, 'max_logp': 4.0, 'max_chiral_centres': 2}
Filter SQL:  hac >= 24 AND hac <= 26 AND rot_bonds <= 5 AND chiral_centres <= 2 AND logp <= 4.0
Database analysis:
  library: chemspace, molport
           took 0.0006234645843505859s
  molecule: 4846137 rows, 538007 pass filters, 50 need property calculation
            took 0.34273648262023926s
  enumeration - all data: 18503940 rows, 1411113 molecules enumerated, 3435024 need enumeration
             with filter: 3312155 rows, 183642 molecules enumerated, 354365 need enumeration
             took 11.46164083480835s
  conformer - all data: 206492 rows, 18493820 enumerations need conformer generation
           with filter: 5213 rows, 3312035 enumerations need conformer generation
           took 4.264547348022461s

The first few lines lists some basic information and your filters. The more interesting bit is the Database analysis section. This has a number of sections with details of the contents of the different database tables (see above).

library: this lists the different libraries that have been loaded, in this case chemspace and molport.

molecule: this lists the molecules that have been loaded from those libraries. In this case 4,846,137 have been loaded (some might be present in multiple libraries and in different forms in those libraries), 538,007 pass the filters in your specification and 50 need property calculation (those cannot pass the filters as there is no data to filter on).

** enumeration**: the first line tells you that there are 18,503,940 enumerated forms which come from 1,411,113 molecules, and that 3,435,024 molecules still need to be enumerated. The second line tells you those same number but for molecules that pass the filters in your specification. So in this case 183,642 molecules have been enumerated but 354,365 still need to be enumeration e.g. you should do more enumeration if you want to have a full set of data.

conformer: This takes the same form as for the enumeration table. In this case you will see that very few molecules have had 3D conformers generated.

It is important to note that this analysis can take a long time to run as the amount of data to process is very large. This is especially the case for analysis of conformers. There are options for skipping the enumeration or conformer analysis. Only perform these if you are interested in that data.

Also note that there will nearly always be a small number of molecules that cannot be enumerated or have 3D conformers generated. This is because they may not be valid molecules, or cannot have realistic conformers created (e…g. incorrectly defined stereochemistry in cage-like structures). So do not expect the number that need attention to drop to zero, just a relatively small number.

Extracting molecules

See job docs.

You can use the moldb-extract-molecules job to extract a set of molecules to a file for further work.
This will just contain the SMILES of the molecules that have been loaded. You will specify a specification file with the filters that you want to apply. The data will look like this:

O=C(CCc1c[nH]c2ccccc12)OCc1ccccc1       1       CSMB00000000002 chemspace
O=C(CCN1C(=O)c2ccccc2C1=O)OCc1ccccc1    4       CSMB00000000014 chemspace
O=c1oc2ccccc2n1Cc1ccccc1        7       CSMB00000000021,MolPort-000-205-613,MolPort-000-205-613 chemspace,molport,molport
CC(NC(=O)c1ccccc1)C(=O)OCc1ccccc1       8       CSMB00000000023 chemspace
CCCCc1nc2ccccc2n1CC#N   14      CSMB00000000036 chemspace
N#CCN1C(=O)NC(c2ccccc2)C1=O     15      CSMB00000000037 chemspace
Cc1nc(OCc2ccccc2)c2ccccc2n1     19      MolPort-002-800-674,CSMB00000000048     molport,chemspace
O=C1CCC(C(=O)OCc2ccccc2)=NN1    21      CSMB00000000058 chemspace
COC(=O)c1cc(O)c2ccccc2c1O       25      CSMB00000000073,MolPort-022-386-978     chemspace,molport

This is tab separated text, the fist column is the molecule SMILES, the second it’s database ID (the value is not significant and will differ between databases), the third are the vendor codes (comma separated) and the fourth the library names.

Extracting enumerated forms

See job docs.

Once the molecular forms are enumerated to your satisfaction then you can extract them from the database e.g. as inputs to a docking run. Do this using themoldb-extract-enums job. Again this takes a specification file with your filters. The data can be extracted either as a SD-file or as tab separated text using ChemAxon extended SMILES (including 3D coordinates). The text format data looks like this:

COc1ccccc1CNC(=O)c1cc2cccc3c2n1CCS3 |(3.68592,0.731492,-2.84601;3.14071,-0.00897012,-1.76092;3.87368,-0.0624024,-0.603712;5.05909,0.645502,-0.378643;5.72192,0.545828,0.847366;5.20394,-0.255531,1.85969;4.02363,-0.964424,1.64624;3.35323,-0.885241,0.413658;2.07135,-1.65663,0.223073;0.922207,-0.800762,0.395907;-0.340539,-1.33164,0.51044;-0.529502,-2.54027,0.612926;-1.41446,-0.352371,0.480796;-1.35814,1.0263,0.664994;-2.67443,1.55765,0.614267;-3.22836,2.84385,0.758018;-4.6183,3.00299,0.674159;-5.45075,1.90594,0.449046;-4.91667,0.625234,0.291324;-3.52667,0.486527,0.385015;-2.75025,-0.648827,0.287425;-3.34243,-1.9558,0.0370756;-4.64833,-1.8805,-0.756387;-5.9128,-0.811053,0.0211639)|  1199797 2512502 B       COc1ccccc1CNC(=O)c1cc2cccc3c2n1CCS3     MolPort-007-761-131     molport
COc1ccccc1CN=C(O)c1cc2cccc3c2n1CCS3 |(-4.65386,-0.273205,1.99763;-3.62464,0.426876,1.30454;-3.70872,0.465085,-0.0650903;-4.65089,-0.237069,-0.824408;-4.63563,-0.160729,-2.2196;-3.67628,0.610277,-2.86764;-2.73328,1.31198,-2.12026;-2.74438,1.25685,-0.715711;-1.69732,2.03356,0.0487065;-0.379686,1.41306,-0.00681427;-0.0493363,0.494128,0.819522;-0.853151,-0.0304245,1.7648;1.2602,-0.108597,0.849298;1.78126,-0.968051,1.81229;3.13364,-1.25918,1.49878;4.15129,-2.01564,2.11014;5.42023,-2.06089,1.51695;5.67956,-1.37159,0.331755;4.67639,-0.630516,-0.296703;3.41792,-0.590497,0.316016;2.2685,0.0634605,-0.0794853;2.22098,0.832256,-1.31479;3.20383,0.332552,-2.37578;4.93628,0.274373,-1.79391)| 1199798 2512502 T       COc1ccccc1CNC(=O)c1cc2cccc3c2n1CCS3     MolPort-007-761-131     molport
COc1ccccc1C[N-]C(=O)c1cc2cccc3c2n1CCS3 |(-3.44429,1.93788,-1.9724;-4.29889,0.791716,-1.93345;-4.15032,0.151445,-0.726943;-5.08132,0.4577,0.272587;-5.01441,-0.179425,1.51155;-4.02767,-1.13588,1.74365;-3.11521,-1.45667,0.734912;-3.15119,-0.815022,-0.5192;-2.19119,-1.23948,-1.62152;-0.828904,-1.39828,-1.10193;-0.0293204,-0.336602,-1.2261;-0.220818,0.663731,-1.91051;1.18473,-0.514434,-0.424188;1.51557,-1.52458,0.478309;2.76644,-1.23272,1.07874;3.56519,-1.85346,2.0585;4.77133,-1.25108,2.44086;5.19159,-0.0578281,1.84949;4.42389,0.55836,0.85744;3.21197,-0.048428,0.507269;2.25887,0.352924,-0.409909;2.42995,1.55595,-1.205;3.8985,1.8935,-1.46301;4.89819,2.06798,0.0640146)|     1199799   2512502 M       COc1ccccc1CNC(=O)c1cc2cccc3c2n1CCS3     MolPort-007-761-131     molport
Cc1ccccc1Nc1c(-c2ccc(Cl)cc2)c(=O)c1=O |(-2.45465,0.426788,2.41544;-2.96162,0.529278,1.00098;-4.32261,0.823891,0.801085;-4.85307,0.908292,-0.483211;-4.02943,0.701269,-1.58163;-2.67415,0.415456,-1.39692;-2.11113,0.337443,-0.114571;-0.744276,0.0169468,0.0885164;0.10651,-0.635112,-0.673757;1.45006,-0.859605,-0.714661;2.71137,-0.444987,0.0643614;2.65958,0.65375,0.925193;3.80058,1.03298,1.63743;4.98534,0.314627,1.48468;6.38783,0.783215,2.36299;5.04011,-0.778605,0.622236;3.90105,-1.16,-0.0921494;1.38725,-1.73467,-1.87263;2.11772,-2.37958,-2.58197;-0.0855325,-1.52418,-1.85895;-0.961051,-1.96242,-2.56503)|        1199800 2513730 B       Cc1ccccc1Nc1c(-c2ccc(Cl)cc2)c(=O)c1=O   MolPort-007-762-583       molport

The first column is the ChemAxon extended SMILES, which has the format smiles |coords|, a space being after the SMILES, and corrdinates as the ‘comment’ between pipe symbols. RDKit, ChemAxon and CDK toolkits will be able to read this. The second field is the enumerated row ID, the third its corresponding molecule ID, the fourth the type of enumerated form (B = base molecule, T = tautomer, M = microstate, C = stereoisomers), the fifth the SMILES of the base molecule which was enumerated, the sixth the supplier codes and the seventh the library name.

If generating SD-file format then the same information is basically available as SD-file fields.

Extracting 3D conformers

See job docs.

The low energy 3D conformers can be extracted using the moldb-extract-confs job.
The process is very similar to extracting the enumerated forms, except that currently only SD-file format is supported (hope to add cxsmiles soon).