Chemistry Toolkit Rosetta Wiki
Register
Advertisement

Ertl, Rohde, and Selzer (J. Med. Chem., 43:3714-3717, 2000) published an algorithm for fast molecular polar surface area (PSA). Part of it involves summing up partial surface values based on fragment contributions. Each fragment corresponds to a SMARTS match.


The goal of this task is get an idea of how to do a set of SMARTS matches when the data comes in from an external table. In this case it's a data table from TJ O'Donnell's CHORD chemistry extension for PostgreSQL, listed at http://www.gnova.com/book/tpsa.tab and available for use here with permission. Each line in the file contains three tab-separated fields. The first line is the header. The other lines define a fragment contribution. The first field is the partial surface area contribution, for each SMARTS pattern match defined in the second column. The last column is a comment. Note that the first SMARTS definition contains a typo, it should be "[N+0;H0;D1;v3]" instead of "[N0;H0;D1;v3]".

To compute the topological polar surface area (for purposes of this task) of a given structure, take the sum over all fragment contributions, weighted by the number of times that fragment matches.

Implementation[]

Write a function or method named "TPSA" which gets its data from the file "tpsa.tab". The function should take a molecule record as input, and return the TPSA value as a float. Use the function to calculate the TPSA of "CN2C(=O)N(C)C(=O)C1=C2N=CN1C". The answer should be 61.82, which agrees exactly with Ertl's online TPSA tool but not with PubChem's value of 58.4.

Indigo/Python[]

import sys
import collections
import indigo
 
indigo = indigo.Indigo()

# Some place to store the pattern defintions
Pattern = collections.namedtuple("Pattern", ["value", "subsearch"])
patterns = []
 
# Get the patterns from the tpsa.tab file, ignoring the header line
for line in open("tpsa.tab").readlines()[1:]:
    # Extract the fields
    value, smarts, comment = line.split("\t")
 
    subsearch = indigo.loadSmarts(smarts)
 
    # Store for later use
    patterns.append( Pattern(float(value), subsearch) )
 
# Helper function to count how many times a substructure matches
def count_matches(subsearch, mol):
    return indigo.countSubstructureMatches(subsearch, mol)
 
def TPSA(mol):
    "Compute the topological polar surface area of a molecule"
    return sum(count_matches(pattern.subsearch, mol)*pattern.value
                   for pattern in patterns)
 
# Test it with the reference structure
mol = indigo.loadMolecule("CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
print TPSA(mol)

OpenBabel/Rubabel[]

require 'rubabel'
lines = IO.readlines("tpsa.tab")
header = lines.shift
@patterns = lines.map {|line| line.chomp.split("\t") }

def TPSA(mol)
  @patterns.inject(0.0) {|s,p| s + p[0].to_f * mol.matches(p[1], false).size }
end

puts TPSA( Rubabel["CN2C(=O)N(C)C(=O)C1=C2N=CN1C"] )

OpenEye/Python[]

from openeye.oechem import *
import collections

# Some place to store the pattern defintions
Pattern = collections.namedtuple("Pattern", ["value", "subsearch"])
patterns = []

# Get the patterns from the tpsa.tab file, ignoring the header line
for line in open("tpsa.tab").readlines()[1:]:
    # Extract the fields
    value, smarts, comment = line.split("\t")

    # Use the SMARTS to define a subsearch object
    subsearch = OESubSearch(smarts)

    # Store for later use
    patterns.append( Pattern(float(value), subsearch) )

# Helper function to count how many times a substructure matches
def count_matches(subsearch, mol):
    return sum(1 for match in subsearch.Match(mol))

def TPSA(mol):
    "Compute the topological polar surface area of a molecule"
    return sum(count_matches(pattern.subsearch, mol)*pattern.value
                   for pattern in patterns)

# Test it with the reference structure
mol = OEGraphMol()
OEParseSmiles(mol, "CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
print TPSA(mol)

RDKit/Python[]

from rdkit import Chem
import collections

# Some place to store the pattern defintions
Pattern = collections.namedtuple("Pattern", ["value", "subsearch"])
patterns = []

# Get the patterns from the tpsa.tab file, ignoring the header line
for line in open("tpsa.tab").readlines()[1:]:
    # Extract the fields
    value, smarts, comment = line.split("\t")

    # Use the SMARTS to define a subsearch object
    subsearch = Chem.MolFromSmarts(smarts)

    # Store for later use
    patterns.append( Pattern(float(value), subsearch) )

# Helper function to count how many times a substructure matches
def count_matches(subsearch, mol):
    return len(mol.GetSubstructMatches(subsearch))

def TPSA(mol):
    "Compute the topological polar surface area of a molecule"
    return sum(count_matches(pattern.subsearch, mol)*pattern.value
                   for pattern in patterns)

# Test it with the reference structure
mol = Chem.MolFromSmiles("CN2C(=O)N(C)C(=O)C1=C2N=CN1C")
print TPSA(mol)

Cactvs/Tcl[]

set cactvs(aromaticity_model) daylight
set eh [ens create CN2C(=O)N(C)C(=O)C1=C2N=CN1C]
set tpsa 0.0
table loop [table read tpsa.tab] row {
    lassign $row v smarts
    set tpsa [expr $tpsa+[match ss -charge 1 -mode distinct $smarts $eh]*$v]
}
puts $tpsa

The table reader needs no detailed instructions - it automatically and correctly analyzes the structure of the parameter file.

We need to switch the aromaticity model to the decidedly weird Daylight definition to get the requested result. Cactvs by default does not think that exocyclic keto groups are compatible with aromaticity. With its own model, the result is a familiar 58.44 (and that is no coincidence).

Cactvs/Python[]

cactvs['aromaticity_model']='daylight'
e=Ens('CN2C(=O)N(C)C(=O)C1=C2N=CN1C')
tpsa=0.0
for row in Table.Read('tpsa.tab'):
   tpsa +=match('ss',row[1],e,charge=True,mode='distinct')*row[0]
print(tpsa)
Advertisement