Wikia

Chemistry Toolkit Rosetta Wiki

Heavy atom counts from an SD file

Talk2
21pages on
this wiki

For each record from the benzodiazepine file, print the total number of heavy atoms in each record (that is, exclude hydrogens). You must get the data from the connection table and not from the tag data. (In particular, don't use PUBCHEM_HEAVY_ATOM_COUNT .)

The output is one output line per record, containing the count as an integer. If at all possible, show how to read directly from the gzip'ed input SD file. The first 5 heavy atom counts are 21, 24, 22, 22 and 22. A copy of the expected output is available for reference.

Some of the hydrogens in the data set have atomic weights of 2 or 3, which may cause difficulties for toolkits which convert 1H hydrogens to implicit form while leaving 2H and 3H to explicit form. Please make sure your output agrees with the reference output.

Suggestion: Please add a couple of '$$$$' SD data lines to the file, to trip those naive implementations which think that a '$$$$' is automatically a record separator!

CDK/JavaEdit

Save the code to "countHeavyAtoms.java" then do

javac countHeavyAtoms.java
   java -cp ./:$HOME/cdk-1.2.4.1.jar countHeavyAtoms > countHeavyAtoms.out
import java.io.*;
import org.openscience.cdk.interfaces.IChemObjectBuilder;
import org.openscience.cdk.io.iterator.IteratingMDLReader;
import org.openscience.cdk.Molecule;
import org.openscience.cdk.DefaultChemObjectBuilder;
import org.openscience.cdk.interfaces.IAtom;
 
public class countHeavyAtoms {	
    public static void main(String[] args) {	
        FileReader sdfile = null;
        try {
            /* CDK does not automatically understand gzipped files */
            sdfile = new FileReader(new File("benzodiazepine.sdf"));
        } catch (FileNotFoundException e) {
            System.err.println("benzodiazepine.sdf not found");
            System.exit(1);
        }
        IteratingMDLReader mdliter = new IteratingMDLReader(sdfile,
                                            DefaultChemObjectBuilder.getInstance()
        while (mdliter.hasNext()) {
            Molecule mol = (Molecule) mdliter.next();
            int numHeavies = 0;
            for (IAtom atom : mol.atoms()) {
                if (atom.getAtomicNumber() > 1) {
                    numHeavies++;
                }
            }
            System.out.println(numHeavies);
        }
    }
}

CDK/GroovyEdit

Save the code to "countHeavyAtoms.groovy" then do:

 groovy countHeavyAtoms.groovy > countHeavyAtoms.out
@GrabResolver(
  name='idea',
  root='http://ambit.uni-plovdiv.bg:8083/nexus/content/repositories/thirdparty/'
)
@Grapes([
  @Grab(
    group='org.openscience.cdk',
    module='cdk-io',
    version='1.4.11'
  ),
  @Grab(
    group='org.openscience.cdk',
    module='cdk-silent',
    version='1.4.11' 
  )
])
 
import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.io.*;
import org.openscience.cdk.io.iterator.*;
import org.openscience.cdk.silent.*;
import org.openscience.cdk.tools.manipulator.*;
import java.util.zip.GZIPInputStream;
 
iterator = new IteratingMDLReader(
  new GZIPInputStream(
    new File("benzodiazepine.sdf.gz").newInputStream()
  ),
  SilentChemObjectBuilder.getInstance()
)
iterator.setReaderMode(IChemObjectReader.Mode.STRICT)
while (iterator.hasNext()) {
  mol = iterator.next()
  heavyAtomCount = 0
  for (atom in mol.atoms()) {
    if (1 == atom.atomicNumber || "H" == atom.symbol) {
      // do not count hydrogens
    } else {
      heavyAtomCount++
    }
  }
  println heavyAtomCount
}

CDK/Cinfony/JythonEdit

from cinfony import cdk
 
manip = cdk.cdk.tools.manipulator.AtomContainerManipulator()
for mol in cdk.readfile("sdf", "benzodiazepine.sdf"):
    print len(manip.getHeavyAtoms(mol.Molecule))

Chemkit/C++Edit

#include <iostream>
 
#include <chemkit/atom.h>
#include <chemkit/foreach.h>
#include <chemkit/molecule.h>
#include <chemkit/moleculefile.h>
 
using namespace chemkit;
 
int main(void)
{
    MoleculeFile file("benzodiazepine.sdf.gz");
    bool ok = file.read();
    if(!ok){
        std::cerr << "Error: " << file.errorString() << std::endl;
        return -1;
    }
 
    foreach(const Molecule *molecule, file.molecules()){
        int heavyAtomCount = 0;
 
        foreach(const Atom *atom, molecule->atoms()){
            if(!atom->isTerminalHydrogen()){
                heavyAtomCount++;
            }
        }
 
        std::cout << heavyAtomCount << std::endl;
    }
 
    return 0;
}

Indigo/C++Edit

#include "base_cpp/scanner.h"
#include "molecule/molecule.h"
#include "molecule/sdf_loader.h"
#include "molecule/molfile_loader.h"
 
int main (void)
{
   const char *filename = "benzodiazepine.sdf"; // you can use ".sdf.gz" as well
 
   try
   {
      FileScanner scanner(filename);
      SdfLoader loader(scanner);
 
      while (!loader.isEOF())
      {
         loader.readNext();
         Molecule mol;
         BufferScanner molscan(loader.data);
         MolfileLoader molload(molscan);
         molload.loadMolecule(mol, false);
 
         int heavy_count = 0;
         for (int i = mol.vertexBegin(); i != mol.vertexEnd(); i = mol.vertexNext(i))
            if (mol.getAtom(i).label > ELEM_H)
               heavy_count++;
 
         cout << heavy_count << endl;
      }
   }
   catch (Exception &e)
   {
      cerr << "error: " << e.message() << endl;
      return -1;
   }
 
   return 0;
}

Instructions:

  1. Unpack 'graph' and 'molecule' projects into some folder
  2. Create 'utils' folder nearby
  3. Paste the above code into utils/count_heavy.cpp file
  4. Compile the file using the following commands:
    $ cd graph; make CONF=Release32; cd ..
    $ cd molecule; make CONF=Release32; cd ..
    $ cd utils
    $ gcc count_heavy.cpp -o count_heavy -O3 -m32 -I.. -I../common ../molecule/dist/Release32/GNU-Linux-x86/libmolecule.a ../graph/dist/Release32/GNU-Linux-x86/libgraph.a -lpthread -lstdc++
  5. Run the program like that:
    $ ./count_heavy

Indigo/PythonEdit

from indigo import Indigo
indigo = Indigo()
 
for mol in indigo.iterateSDFile("benzodiazepine.sdf.gz"):
    print mol.countHeavyAtoms()

OpenBabel/PythonEdit

import openbabel as ob
 
obconversion = ob.OBConversion()
obconversion.SetInFormat("sdf")
 
obmol = ob.OBMol()
notatend = obconversion.ReadFile(obmol, "benzodiazepine.sdf.gz")
while notatend:
    print obmol.NumHvyAtoms()
    # reuse the molecule for 0.3 seconds of savings vs. making an new one
    obmol.Clear()
    notatend = obconversion.Read(obmol)

OpenBabel/PybelEdit

This builds on and simplifies OpenBabel

import pybel
 
for mol in pybel.readfile("sdf", "benzodiazepine.sdf.gz"):
    print mol.OBMol.NumHvyAtoms()

OpenBabel/PythonEdit

import openbabel as ob
 
obconversion = ob.OBConversion()
obconversion.SetInFormat("sdf")
 
obmol = ob.OBMol()
notatend = obconversion.ReadFile(obmol, "benzodiazepine.sdf.gz")
while notatend:
    print obmol.NumHvyAtoms()
    # reuse the molecule for 0.3 seconds of savings vs. making an new one
    obmol.Clear()
    notatend = obconversion.Read(obmol)

OpenBabel/RubabelEdit

require 'rubabel'
Rubabel.foreach("benzodiazepine.sdf.gz") {|mol| puts mol.ob.num_hvy_atoms }

OpenEye/PythonEdit

from openeye.oechem import *
 
ifs = oemolistream()
ifs.open("benzodiazepine.sdf.gz")
for mol in ifs.GetOEGraphMols():
    print OECount(mol, OEIsHeavy())
    # Can't use
    #    print mol.NumAtoms()
    # because the input structure contains explicit hydrogens.


RDKit/PythonEdit

I could also have iterated over the atoms and only counted the non-hydrogen ones, but I doubt that would be any faster or elegant.

from rdkit import Chem
 
# RDKit does not seem to support gzip'ed SD files
suppl = Chem.SDMolSupplier("benzodiazepine.sdf")
heavy_patt = Chem.MolFromSmarts("[!#1]")
 
for mol in suppl:
    print len(mol.GetSubstructMatches(heavy_patt))

The method "mol.GetNumAtoms()" is supposed to report only the non-hydrogen counts, but actually reports the number of explicit atoms in the molecular graph.


RDKit/Cinfony/PythonEdit

from cinfony import rdk
 
for mol in rdk.readfile("sdf", "benzodiazepine.sdf"):
    print mol.calcdesc(['HeavyAtomCount'])['HeavyAtomCount']

Unix command-lineEdit

It is possible to do this on the Unix command-line using a single (very long) line. To make it easier to understand I've written it using an awk script named 'heavy_atom_counts.awk'. Use it as:

gzcat benzodiazepine.sdf.gz | awk -f heavy_atom_counts.awk
# This file is named "heavy_atom_counts.awk"
# Check track of the line number in the record
{lineno += 1}

# Get the atom counts from the record
lineno == 4 { num_atoms = int(substr($0, 1, 3)); num_hydrogens = 0}

# Count the number of hydrogens
4 < lineno && lineno <= 4+num_atoms && substr($0, 32, 2) == "H " {num_hydrogens++}

# And the end of the record, print the number of heavy atoms and reset
/^\$\$\$\$/ {print num_atoms - num_hydrogens; lineno = 0}


Cactvs/TclEdit

molfile loop "benzodiazepine.sdf.gz" eh {
   puts [ens get $eh E_HEAVY_ATOM_COUNT]
}

The toolkit can read compressed files directly. The loop construct automatically discards the structure read for each record, so this works on arbitrarily big files. Also, the format of the input file can be anything which can be identified automatically, so this is not limited to SD files.

Cactvs/PythonEdit

for eh in Molfile('benzodiazepine.sdf.gz'):
    print(eh.E_HEAVY_ATOM_COUNT)

The Python language version has the same capabilities as the Tcl variant.

Around Wikia's network

Random Wiki