Wikia

Chemistry Toolkit Rosetta Wiki

Working with SD tag data

Talk0
21pages on
this wiki

An SD file contains a set of records. Each record has a connection table and set of fields which together describe the molecule. This is call the 'molfile'. A record also contains a set of data fields often called "tag data" which in its simplest and most common form looks like

> <PUBCHEM_NIST_INCHIKEY>
PUMYFTJOWAJIKF-UHFFFAOYSA-N

> <PUBCHEM_XLOGP3>
3.8

In this case the tag PUBCHEM_XLOGP3 has the value 3.8.

The tag data is often used as a way to store extra information about the molecule. This can be supplier information, additional identifiers, an encoded fingerprint, or many other possibilities.

This task show how to use a toolkit to get and set tag data by implementing Lipinski's Rule of Five where the inputs come from the data data and the result goes into a new tag field named RULE5.

Lipinski's rule says that, in general, an orally active drug has no more than one violation of the following criteria:

  • Not more than 5 hydrogen bond donors (nitrogen or oxygen atoms with one or more hydrogen atoms)
  • Not more than 10 hydrogen bond acceptors (nitrogen or oxygen atoms)
  • A molecular weight under 500 daltons
  • An octanol-water partition coefficient log P of less than 5


ImplementationEdit

The input file is SD file from the benzodiazepine data set, available directly from [1]. Every record contains the tags PUBCHEM_CACTVS_HBOND_DONOR, PUBCHEM_CACTVS_HBOND_ACCEPTOR and PUBCHEM_MOLECULAR_WEIGHT, and most of the records contain the tag PUBCHEM_XLOGP3. Here is an example of those tag data fields:

> <PUBCHEM_CACTVS_HBOND_ACCEPTOR>
4
> <PUBCHEM_CACTVS_HBOND_DONOR>
1
> <PUBCHEM_XLOGP3>
1.2
> <PUBCHEM_MOLECULAR_WEIGHT>
359.20938

The program must create a new SD file which is the same as the input file but with a new tag data field named "RULE5". This must be "1" if the record passes Lipinski's rule, "0" if it does not, and "no logP" if the PUBCHEM_XLOGP3 field is missing.

"The same" deserves some clarification. Most toolkits do not preserve the SD input precisely. For example, the "CTfile Formats" specification says that some fields are obsolete or unneeded, or are only appropriate to query and reaction formats. Different toolkits handle these differently, and it's okay for these fields to differ. There are also places where the amount of whitespace isn't fully specified, and different toolkits give different answers.

It's expected that a toolkit will modify those fields. The exception will be those solutions which manipulate the SD file directly as text and don't go through a molecule object and back again.

Line #2 of each record is supposed to include the name of the program which generated the output and include time stamp, although it may also be blank. Some solutions will keep the line as-is, some will put the toolkits's name and current timestamp in its place, and still others will force it to be blank. Please note what the solution does.

A solution must not reorder or modify the atom and bond information nor change the contents of any existing tag. It should only add a single new tag to the output. If it does anything else, please note the exception.

OpenBabel/PybelEdit

import pybel
output = pybel.Outputfile("sdf", "RULE5.sdf")
for mol in pybel.readfile("sdf", "benzodiazepine.sdf.gz"):
    if "PUBCHEM_XLOGP3" not in mol.data:
        mol.data["RULE5"] = "no logP"
    else:
        count = (int(mol.data["PUBCHEM_CACTVS_HBOND_DONOR"]) <= 5) + \
                (int(mol.data["PUBCHEM_CACTVS_HBOND_ACCEPTOR"]) <= 10) + \
                (float(mol.data["PUBCHEM_MOLECULAR_WEIGHT"]) <= 500) + \
                (float(mol.data["PUBCHEM_XLOGP3"]) <= 5)
        if count >= 3:
            mol.data["RULE5"] = "1"
        else:
            mol.data["RULE5"] = "0"
    output.write(mol)
output.close()

This updates the #2 line of each molfile to mention "OpenBabel" and the current timestamp.

OpenBabel/RubabelEdit

require 'rubabel'
File.open("RULE5.sdf", 'w') do |out|
  Rubabel.foreach("benzodiazepine.sdf.gz") do |mol|
    mol.data["RULE5"] = 
      if mol.data.key?('PUBCHEM_XLOGP3')
        num_true = [
          mol.data["PUBCHEM_CACTVS_HBOND_DONOR"].to_i <= 5,
          mol.data["PUBCHEM_CACTVS_HBOND_ACCEPTOR"].to_i <= 10,
          mol.data["PUBCHEM_MOLECULAR_WEIGHT"].to_f <= 500,
          mol.data["PUBCHEM_XLOGP3"].to_f <= 5 
        ].count(true)
        if num_true >= 3 then "1" 
        else "0"
        end
      else
        "no logP"
      end
    out.print mol.write(:sdf)
  end
end
# NOTE for wiki: This updates the #2 line of each molfile to mention "OpenBabel" and the current timestamp, as it ought to.

OpenEye/PythonEdit

from openeye.oechem import *
 
ifs = oemolistream()
ofs = oemolostream()
from rdkit import Chem
suppl = Chem.SDMolSupplier('benzodiazepine.sdf')
for mol in suppl:
    if mol is None: continue
suppl = Chem.SDMolSupplier('benzodiazepine.sdf')
writer = Chem.SDWriter('benzodiazepine.r5.sdf')
for mol in suppl:
    if mol is None: continue
    if not mol.HasProp('PUBCHEM_XLOGP3'):
        mol.SetProp('RULE5','no logp')
    else:
        rule5=0
        if int(mol.GetProp('PUBCHEM_CACTVS_HBOND_DONOR')) > 5:
            rule5+=1
        if int(mol.GetProp('PUBCHEM_CACTVS_HBOND_ACCEPTOR')) > 10:
            rule5+=1
        if float(mol.GetProp('PUBCHEM_MOLECULAR_WEIGHT')) > 500.0:
            rule5+=1
        if float(mol.GetProp('PUBCHEM_XLOGP3')) > 5.0:
            rule5+=1
        if rule5>1:
            mol.SetProp('RULE5','0')
        else:
            mol.SetProp('RULE5','1')
        w.write(mol)
 
 
ifs.open("benzodiazepine.sdf.gz")
ofs.open("RULE5.sdf")
 
def num_Lipinski_failures(mol):
    failures = 0
    if int(OEGetSDData(mol, "PUBCHEM_CACTVS_HBOND_DONOR")) > 5:
        failures += 1
    if int(OEGetSDData(mol, "PUBCHEM_CACTVS_HBOND_ACCEPTOR")) > 10:
        failures += 1
    if float(OEGetSDData(mol, "PUBCHEM_MOLECULAR_WEIGHT")) > 500.0:
        failures += 1
    if float(OEGetSDData(mol, "PUBCHEM_XLOGP3")) > 5.0:
        failures += 1
    return failures
 
for mol in ifs.GetOEGraphMols():
    if not OEHasSDData(mol, "PUBCHEM_XLOGP3"):
        rule5 = "no logP"
    elif num_Lipinski_failures(mol) > 1:
        rule5 = "0"
    else:
        rule5 = "1"
    OESetSDData(mol, "RULE5", rule5)
    OEWriteMolecule(ofs, mol)

This updates the #2 line of each molfile to mention "OEChem" and the current timestamp.

Unix/PythonEdit

Because the tag data fields are line oriented, this is a problem that can be solved with standard Unix tools and without a chemistry toolkit. Here's one solution, but while it does work I don't recommend it as a general approach.

infile = open("benzodiazepine.sdf", "U")
outfile = open("RULE5.sdf", "w")
 
# Helper function. Use to copy and use
# the first data line of a tag I need.
def next_line():
    line = next(infile)
    outfile.write(line)
    return line
 
# Initialize for the first record
seen_xlogp = False
failures = 0
 
for line in infile:
    if line == "$$$$\n":
        # Write the RULE5 tag data before ending the record
        if not seen_xlogp:
            outfile.write("> <RULE5>\nno logP\n\n")
        elif failures > 1:
            outfile.write("> <RULE5>\n0\n\n")
        else:
            outfile.write("> <RULE5>\n1\n\n")
        outfile.write(line)
 
        # reset for the next record
        seen_xlogp = False
        failures = 0
        continue
 
    # Copy the input line to the output line
    outfile.write(line)
    if line[:1] == ">":
        # Check for tag data fields
        if "<PUBCHEM_CACTVS_HBOND_DONOR>" in line:
            failures += int(next_line()) > 5
        elif "<PUBCHEM_CACTVS_HBOND_ACCEPTOR>" in line:
            failures += int(next_line()) > 10
        elif "<PUBCHEM_MOLECULAR_WEIGHT>" in line:
            failures += float(next_line()) > 500.0
        elif "<PUBCHEM_XLOGP3>" in line:
            failures += float(next_line()) > 5.0
            seen_xlogp = True

RDKit/PythonEdit

from rdkit import Chem
suppl = Chem.SDMolSupplier('benzodiazepine.sdf')
writer = Chem.SDWriter('benzodiazepine.r5.sdf')
for mol in suppl:
    if mol is None: continue
    if not mol.HasProp('PUBCHEM_XLOGP3'):
        mol.SetProp('RULE5','no logp')
    else:
        rule5=0
        if int(mol.GetProp('PUBCHEM_CACTVS_HBOND_DONOR')) > 5:
            rule5+=1
        if int(mol.GetProp('PUBCHEM_CACTVS_HBOND_ACCEPTOR')) > 10:
            rule5+=1
        if float(mol.GetProp('PUBCHEM_MOLECULAR_WEIGHT')) > 500.0:
            rule5+=1
        if float(mol.GetProp('PUBCHEM_XLOGP3')) > 5.0:
            rule5+=1
        if rule5>1:
            mol.SetProp('RULE5','0')
        else:
            mol.SetProp('RULE5','1')
    writer.write(mol)
writer.flush()

This updates the #2 line of each molfile to mention "RDKit".


SDF toolkitEdit

The SDF toolkit has been programmed to handle the SD tag data. No changes are made to the connection table or the other SD tags.

Standard input and output are used.


#!/usr/bin/perl

require MDL_sdf ;


while( my $mol = MDL_sdf_non_parsed_molecule->readFromInput() ) {	
	my $logP = $mol->get("PUBCHEM_XLOGP3");
	my $rule5 = "no logp";
	
	if(defined($logP)) {
		my $count = $mol->get("PUBCHEM_CACTVS_HBOND_DONOR") > 5;
		$count += $mol->get("PUBCHEM_CACTVS_HBOND_ACCEPTOR") > 10;
		$count += $mol->get("PUBCHEM_MOLECULAR_WEIGHT") > 500;
		$count += $logP > 5;

		$rule5 = ($count>=3 ? 1:0);
	
	}
	$mol->set("RULE5", $rule5);
	$mol->write()
}

Cactvs/TclEdit

prop create E_MY_LIPINSKY datatype string originalname RULE5
set fh1 [molfile open benzodiazepine.sdf.gz]
set fh2 [molfile open stdout w format sdf]
molfile loop $fh1 eh {
  if {"E_PUBCHEM_XLOGP3" ni [molfile get $fh1 fields]} {
     ens set $eh E_MY_LIPINSKY "no logp"
  } else {
     ens set $eh E_MY_LIPINSKY [ens expr $eh {((PUBCHEM_CACTVS_HBOND_DONOR<=5)+ \
       (PUBCHEM_CACTVS_HBOND_ACCEPTOR<=10)+(PUBCHEM_MOLECULAR_WEIGHT<=500)+(PUBCHEM_XLOGP3<=5))>=3}]
  }
  molfile set $fh2 writelist [concat [molfile get $fh1 fields] E_MY_LIPINSKY]
  molfile write $fh2 $eh
}

Above solution re-encodes the atoms and bonds block, but has the advantage that it works directly on gzip'ed files.

Below is an alternative solution with a byte-exact copy of the input records minus the end-of-record marker, which are then added by direct writing of the additional field to the output channel. The disadvantage is that since the I/O module needs to backspace, this only works (with reasonable performance) on uncompressed input data. Nevertheless, the capability to output a byte-exact copy of the leading original input record is a major advantage not provided by other toolkits, given the many obscure SGROUP data fields etc. that could be present in the input.

This solution additionally takes advantage of property expressions. All required property data associated with the structure object is automatically looked up via its primary name, or an alias such as the SD file tag. If the property data is not yet present, an attempt is made to compute it.

The record copy and line output commands automatically use the EOL characters configured on the structure file output channel.

set fh1 [molfile open benzodiazepine.sdf]   
set fh2 [molfile open stdout w format sdf writeend none]
molfile loop $fh1 eh {
    if {"E_PUBCHEM_XLOGP3" ni [molfile get $fh1 fields]} {
        set v "no logp"                    
    } else { 
        set v [ens expr $eh {((PUBCHEM_CACTVS_HBOND_DONOR<=5)+\
          (PUBCHEM_CACTVS_HBOND_ACCEPTOR<=10)+\
          (PUBCHEM_MOLECULAR_WEIGHT<=500)+(PUBCHEM_XLOGP3<=5))>=3}]  
    } 
    molfile copy $fh1 $fh2 1 -1
    molfile putline $fh2 "> <RULE5>" $v "" {$$$$}
}

But nobody stops us from opening a file more than once. Using this technique, we can combine byte-exact header copy and reading from compressed input:

set fh1a [molfile open benzodiazepine.sdf.gz]   
set fh1b [molfile open benzodiazepine.sdf.gz]   
set fh2 [molfile open stdout w format sdf writeend none]
molfile loop $fh1a eh {
    if {"E_PUBCHEM_XLOGP3" ni [molfile get $fh1a fields]} {
        set v "no logp"                    
    } else { 
        set v [ens expr $eh {((PUBCHEM_CACTVS_HBOND_DONOR<=5)+\
          (PUBCHEM_CACTVS_HBOND_ACCEPTOR<=10)+\
          (PUBCHEM_MOLECULAR_WEIGHT<=500)+(PUBCHEM_XLOGP3<=5))>=3}]  
    } 
    molfile copy $fh1b $fh2 1
    molfile putline $fh2 "> <RULE5>" $v "" {$$$$}
}

Cactvs/PythonEdit

And here the Python variant of the last solution:

fh1a=Molfile('benzodiazepine.sdf.gz')
fh1b=Molfile('benzodiazepine.sdf.gz')
fh2=Molfile('stdout','w',{'format':'sdf','writeend':'none'})
for eh in fh1a:
    if ('E_PUBCHEM_XLOGP3' not in fh1a.fields):
        v='no logp'
    else:
        v=eh.expr('((PUBCHEM_CACTVS_HBOND_DONOR<=5)+\
        (PUBCHEM_CACTVS_HBOND_ACCEPTOR<=10)+\
        (PUBCHEM_MOLECULAR_WEIGHT<=500)+(PUBCHEM_XLOGP3<=5))>=3')
    fh1b.copy(fh2,1)
    fh2.putline('> <RULE5>',v,'','$$$$')

C#/RegexpEdit

    private static String RECORD_END = "$$$$";
    private static String LINEFEED = "\n";
 
    public static void ImportSDF(SqlString sdf_path)
    {
 
        StreamReader sdfreader = new StreamReader(new FileStream(sdf_path.ToString(), FileMode.Open, FileAccess.Read));
 
        String line = "";
        String moltext = "";
        Dictionary<String, String> columns = new Dictionary<string,string>();
 
        while ((line = sdfreader.ReadLine()) != null)
        {
            if (line.Trim().Equals(RECORD_END))
            {
 
                String cols = "";
                String vlus = "";
                foreach (KeyValuePair<String, String> kv in mol2dict(moltext))
                {
                    cols += "[" + kv.Key + "] ,";
                    vlus += "'" + kv.Value.Replace("'", "''") + "' ,";
                    // DO something, insert into database/ write file/ ...
                }
                moltext = "";
                continue;
            }
            moltext += line + LINEFEED;
        }
    }
 
    private static Dictionary<String, String> mol2dict(String moltext)
    {
        List<String> fields = new List<string>();
        Dictionary<String, String> moldict = new Dictionary<string, string>();
 
        Regex regex_fields = new Regex("^> *?<(.*?)>", RegexOptions.IgnoreCase);
        Regex regex_attrs = new Regex("^> *?<(.*?)>.*\n([^>]*)", RegexOptions.Multiline);
        Regex regex_moltext = new Regex("^(.*?M\\s+END)", RegexOptions.Singleline);
 
        Match m0 = regex_moltext.Match(moltext);
        moldict["mol_text"] = m0.Groups[1].ToString();
 
        MatchCollection mc1 = regex_attrs.Matches(moltext);
        for (int i = 0; i < mc1.Count; i++)
        {
            fields.Add(mc1[i].Groups[1].ToString());
            moldict[mc1[i].Groups[1].ToString().Trim()] = mc1[i].Groups[2].ToString().Trim();
        }
 
        return moldict;
 
    }

CDK/Groovy Edit

# newlines are lost from PUBCHEM_COORDINATE_TYPE and PUBCHEM_BONDANNOTATIONS
 
@GrabResolver(
  name='idea',
  root='http://ambit.uni-plovdiv.bg:8083/nexus/content/repositories/thirdparty/'
)
@Grapes([
  @Grab(
    group='org.openscience.cdk',
    module='cdk-io',
    version='1.4.11'
  ),
  @Grab(
    group='org.openscience.cdk',
    module='cdk-silent',
    version='1.4.11' 
  )
])
 
import org.openscience.cdk.io.*;
import org.openscience.cdk.io.iterator.*;
import org.openscience.cdk.silent.*;
import java.util.zip.GZIPInputStream;
 
iterator = new IteratingMDLReader(
  new GZIPInputStream(
    new File("benzodiazepine.sdf.gz")
      .newInputStream()
  ),
  SilentChemObjectBuilder.getInstance()
)
writer = new SDFWriter(
  new FileWriter("RULES5.sdf")
)
while (iterator.hasNext()) {
  mol = iterator.next()
  if (mol.getProperty("PUBCHEM_XLOGP3") == null) {
    mol.setProperty("RULE5", "no logP")
  } else {
    ruleCount = 0;
    if (Integer.valueOf(mol.getProperty("PUBCHEM_CACTVS_HBOND_ACCEPTOR")) <= 10) ruleCount++
    if (Integer.valueOf(mol.getProperty("PUBCHEM_CACTVS_HBOND_DONOR")) <= 5) ruleCount++
    if (Double.valueOf(mol.getProperty("PUBCHEM_MOLECULAR_WEIGHT")) <= 500.0) ruleCount++
    if (Double.valueOf(mol.getProperty("PUBCHEM_XLOGP3")) <= 5.0) ruleCount++
    mol.setProperty("RULE5", ruleCount >= 3  ? "1" : "0")
    writer.write(mol)
  }
}
writer.close()

Around Wikia's network

Random Wiki