The gzip compressed "benzodiazepine" SD file is used in many of the code examples. It is the result of a SMARTS substructure search of PubChem on 19 December 2009. The query SMARTS was "C2(CN=C(C1=CC(=CC=C1N2[*])[*])C3=CC=CC=C3[*])=[*]" which matches the core with connections to 4 different R-groups. The result contains 12,386 records and is available as a compressed SDF and as a compressed SMILES file containing the isomeric SMILES and PubChem ID.

If you try to match that SMARTS pattern against all of the structures in the data set, you see that it will fail for some of the records. For example, some compounds don't have a "=[*]" match and others are in aromatic form. I believe PubChem does this because it's a general purpose tool not meant for cheminformatics specialist who care about the specific aromatic representation and the exact meaning of the SMARTS string.

