FANDOM


A number of molecular descriptors are based on how many times a given SMARTS pattern is uniquely found in a structure. For example, the CACTVS substructure key 122 is set if there are ">= 2 any ring size 3". That query can be written in SMARTS as "*1**1", but of course each ring of size 3 will match 6 times because of symmetry.

Most toolkits have have a way to find all matches for a given SMARTS and a way to find all unique matches for a given SMARTS. "Unique" here means that no two different matches will have the same set of matched atoms. The point of this task is to show how that's done.

ImplementationEdit

Given the SMILES structure "C1CC12C3(C24CC4)CC3" (which is PubChem CID 141640), how many times does "*1**1" match the structure and how many times does the same SMARTS match the structure uniquely?

The answers should be 24 and 4, respectively.

OpenBabel/PybelEdit

import pybel
mol = pybel.readstring("smi", "C1CC12C3(C24CC4)CC3")
smarts = pybel.Smarts("*1**1")
 
# pybel doesn't have a direct way to get the non-unique matches, so
# use the lower-level OpenBabel API directly
smarts.obsmarts.Match(mol.OBMol)
num_matches = sum(1 for indicies in smarts.obsmarts.GetMapList())
 
num_unique_matches = len(smarts.findall(mol))
 
print "number of matches:", num_matches
print "number of unique matches:", num_unique_matches

OpenBabel/RubabelEdit

require 'rubabel'
mol = Rubabel["C1CC12C3(C24CC4)CC3"]
[false, true].map {|uniq| puts mol.matches("*1**1", uniq).size }

OpenEye/PythonEdit

from openeye.oechem import *
 
mol = OEGraphMol()
OEParseSmiles(mol, "C1CC12C3(C24CC4)CC3")
pat = OESubSearch("*1**1")
 
num_matches = sum(1 for match in pat.Match(mol))
num_unique_matches = sum(1 for match in pat.Match(mol, True))
 
print "number of matches:", num_matches
print "number of unique matches:", num_unique_matches

RDKit/PythonEdit

from rdkit import Chem
mol = Chem.MolFromSmiles('C1CC12C3(C24CC4)CC3')
patt = Chem.MolFromSmarts('*1**1')
 
print "number of matches:", len(mol.GetSubstructMatches(patt,uniquify=False))
print "number of unique matches:", len(mol.GetSubstructMatches(patt))

Cactvs/TclEdit

set eh [ens create C1CC12C3(C24CC4)CC3]
puts "Number of matches: [match ss -mode all *1**1 $eh]"
puts "Number of atom set unique matches: [match ss -mode distinct *1**1 $eh]"
puts "Number of topologically unique matches: [match ss -mode unique *1**1 $eh]"

The results are 24, 4 and 2. "Unique" in Cactvs match nomenclature means that not only the matched atoms cannot be the same set, but they also must be topologically distinct from any other match set. The "distinct" match mode simply checks whether a different set of structure atoms was matched (the Rosetta problem), which is a far simpler task.

Cactvs/PythonEdit

The same in Python:

e=Ens('C1CC12C3(C24CC4)CC3')
print('Number of matches:',match('ss','*1**1',e,mode='all'))
print('Number of atom set unique matches:',match('ss','*1**1',e,mode='distinct'))
print('Number of topologically unique matches:',match('ss','*1**1',e,mode='unique'))

CDK/Groovy Edit

The substructure searching code in the CDK is based on a edge matching algorithm, limiting it such that cyclopropane and isobutane cannot be distinguished. Hence, the workaround:

import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.smiles.*;
import org.openscience.cdk.smiles.smarts.*;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
 
SmilesParser sp = new SmilesParser(SilentChemObjectBuilder.getInstance());
atomContainer = sp.parseSmiles("C1CC12C3(C24CC4)CC3");
querytool = new SMARTSQueryTool("*1**1");
 
found = querytool.matches(atomContainer);
if (found) {
  mappings = querytool.getMatchingAtoms()
  hits = 0
  for (int i = 0; i < mappings.size(); i++) {
    atomIndices = mappings.get(i);
    if (atomIndices.size() == 3) {
      // work around the cyclopropane / isobutane equivalence
      hits++
    }
  }
  println "hits: $hits"
 
  mappings = querytool.getUniqueMatchingAtoms()
  uniqueHits = 0
  for (int i = 0; i < mappings.size(); i++) {
    atomIndices = mappings.get(i);
    if (atomIndices.size() == 3) {
      // work around the cyclopropane / isobutane equivalence
      uniqueHits++
    }
  }
  println "unique hits: $uniqueHits"
}

Ad blocker interference detected!


Wikia is a free-to-use site that makes money from advertising. We have a modified experience for viewers using ad blockers

Wikia is not accessible if you’ve made further modifications. Remove the custom ad blocker rule(s) and the page will load as expected.