Chemistry Toolkit Rosetta Wiki

Perform a substructure search on SDF file and report the number of false positives

21pages on
this wiki
Add New Page
Talk0 Share

Take a SMILES query and an SDF file; compute fingerprints for the given query and molecules from the SDF file; perform a substructure search (using the fingerprints for screening); report the total amount of molecules in SDF file; report the number of substructure matches and the number of false positives (i.e. molecules that passed the screening phase but failed the substructure matching phase).

This task originated from a topic on BlueObelisk.


The sample database will be gzip'ed SD file of the benzodiazepine data set. The query structure will be defined as "C1C=C(NC=O)C=CC=1". The output will look like
%d total
%d matched
%d false positives


#include "base_cpp/scanner.h"
#include "molecule/molecule.h"
#include "molecule/sdf_loader.h"
#include "molecule/molfile_loader.h"
#include "molecule/molecule_arom.h"
#include "molecule/molecule_fingerprint.h"
#include "molecule/smiles_loader.h"
#include "molecule/molecule_substructure_matcher.h"
#include "base_c/bitarray.h"
int main (void)
   MoleculeFingerprintParameters parameters;
   memset(&parameters, 0, sizeof(parameters));
   parameters.ord_qwords = 25; // default value in Bingo
   int fpsize = parameters.fingerprintSize();
   Molecule query;
   Array<byte> queryfp;
      BufferScanner scanner("C1C=C(NC=O)C=CC=1");
      SmilesLoader loader(scanner);
      loader.loadMolecule(query, true);
      MoleculeFingerprintBuilder builder(query, parameters);
      queryfp.copy(builder.get(), fpsize);
   int total = 0;
   int passed = 0;
   int matched = 0;
      FileScanner scanner("benzodiazepine.sdf.gz");
      SdfLoader loader(scanner);
      int cnt = 0;
      while (!loader.isEOF())
         Molecule mol;
         BufferScanner molscan(;
            MolfileLoader molload(molscan);
            molload.loadMolecule(mol, false);
         catch (Exception &e)
         MoleculeFingerprintBuilder builder(mol, parameters);
         if (bitTestOnes(queryfp.ptr(), builder.get(), fpsize))
            MoleculeSubstructureMatcher matcher(mol);
            if (matcher.find())
   catch (Exception &e)
      fprintf(stderr, "error: %s\n", e.message());
      return -1;
   printf("%d total\n%d matched\n%d false positives\n", total, matched, passed - matched);
   return 0;


  1. Unpack 'graph' and 'molecule' projects into some folder
  2. Create 'utils' folder nearby
  3. Paste the above code into utils/false_positives.cpp file
  4. Compile the file using the following commands:
    $ cd graph; make CONF=Release32; cd ..
    $ cd molecule; make CONF=Release32; cd ..
    $ cd utils
    $ gcc false_positives.cpp -o false_positives -O3 -m32 -I.. -I../common ../molecule/dist/Release32/GNU-Linux-x86/libmolecule.a ../graph/dist/Release32/GNU-Linux-x86/libgraph.a -lpthread -lstdc++
  5. Run the program like that:
    $ ./false_positives
Expected result:
12386 total
8836 matched
1 false positives


set ss [ens create C1C=C(NC=O)C=CC=1 smarts]
set fh [molfile open benzodiazepine.sdf.gz]
prop setparam E_SCREEN extended 2
set nscreen_passed [molfile scan $fh "structure ~= $ss" count]
set nss_passed [molfile scan $fh "structure >= $ss" count]
puts [format "%d total" [molfile count $fh]]
puts [format "%d matched" $nscreen_passed]
puts [format "%d false positives" [expr $nscreen_passed-$nss_passed]]

This is a rather unrealistic example. Cactvs does not perform screening on files where the complete structure needs to be read first in order to compute these - in that case, a direct substructure match is in all likelihood faster. So in this example we perform a screen-only search and a full substructure search on the same file, and then compare the results. For query-optimized files, the screen statistics and other performance data can be read after a "molfile scan $fh" formatted as a dictionary with a "molfile get $fh scandata" - but in this case, the dictionary entry for the number of read structures is the same as the total record count, and the number of processed screens zero.

The result:

12386 total
9219 matched
382 false positives


nscreen_passed=f.scan('structure ~= {}'.format(ss),'count')
nss_passed=f.scan('structure >= {}'.format(ss),'count')
print("{:d} total".format(f.count()))
print("{:d} matched".format(nscreen_passed))
print("{:d} false positives".format(nscreen_passed-nss_passed)) 

Once more, the output is identical to the Tcl version.

Ad blocker interference detected!

Wikia is a free-to-use site that makes money from advertising. We have a modified experience for viewers using ad blockers

Wikia is not accessible if you’ve made further modifications. Remove the custom ad blocker rule(s) and the page will load as expected.