Saturday, 6 May 2017

Sharp Tools for Java Refactoring: Byte Code Analysis

I'm currently refactoring parts of the CDK core classes. As part of this I often need to find specific patterns/idioms that need to be changed. Whilst source code inspections and an IDE can make this task easy sometimes the tools aren't quite sharp enough.

I needed to find all occurrences of a reference (instead of value) comparison on a particular class. In Java there is no operator overload and so you can have situations like this:
Integer a = new Integer(25);
Integer b = new Integer(25);
if (a == b) {} // false
if (a.equals(b)) {} // true
I mentioned operating overloading but it's more subtle and is more about comparing reference vs. value comparison. In C/C++ we can have similar behaviour:
int aval = 25, bval = 25;
int *a = &aval;
int *b = &bval;
if (a == b) {} // false
if (*a == *b) {} // true
Most IDE's and code inspection programs will warn about common occurrences (for example Integer) but I wanted to find places where the CDK's classes were used like this. A simple text grep will find some but will have false positives and negatives requiring lots of manual checking. Daniel suggested the well known FindBugs might be able help.

Rather than analyze source code like PMD and Checkstyle, FindBugs analyses Java byte code with a set of defaults detectors to find often subtle but critical mistakes. FindBugs can be configured with custom detectors (see here), however the inspection I needed (RC: Suspicious reference comparison to constant) was almost there. After digging around in the source code I found you can provide a list of custom classes to detect. However it took a bit of trial and error to get what I needed.

First up we turn off all inspections except for the one we're looking for (we need to fix many others reported but I was looking for something specific). To do this we create an XML config that will only run the specific inspection (RC for Reference Comparison):
findbugs-include.xml
<?xml version="1.0" encoding="UTF-8"?>
<FindBugsFilter>
  <Match>
    <Bug code="RC"/>
  </Match>
</FindBugsFilter>
We then run findbugs with this configuration and provide the frc.suspicious property.
Running findbugs
$> findbugs -textui \
            -include findbugs-include.xml \
            -property "frc.suspicious=org.openscience.cdk.interfaces.IAtom" \
            base/*/target/cdk-*-2.0-SNAPSHOT.jar 
This produces an accurate report of all the places the references are compared. Here's a sample:
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getOther(IAtom)  At Bond.java:[line 253]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getConnectedAtom(IAtom)  At Bond.java:[line 265]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.getConnectedAtoms(IAtom)  At Bond.java:[line 281]
H C RC: Suspicious comparison of org.openscience.cdk.interfaces.IAtom references in org.openscience.cdk.Bond.contains(IAtom)  At Bond.java:[line 300]
...

Monday, 3 April 2017

CDK AtomContainer's are Slow - Lets fix that

The core class for molecule representation in CDK is the AtomContainer. The AtomContainer uses an edge-list data structure for storing the underlying connection table (see The Right Representation for the Job).

Essentially this edge-list representation is efficient in space. Atoms can be shared between and belong to multiple AtomContainers. Therefore querying connectivity (is this atom connected to this other atom) is linear time in the number of bonds.

The inefficiency of the AtomContainer can really sting. If someone was to describe Morgan's relaxation algorithm you may implement it like Code 1. The algorithm looks reasonable however it will run much slower than you expected. You may expect the runtime of this algorithm to be ~N2 but it's actually ~N3. I've annotated with XXX where the extra effort creeps in.
Code 1 - Naive Morgan-like Relaxation (AtomContainer/AtomIter)
// Step 1. Algorithm body
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
  next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) { // 0..numAtoms
  for (int j = 0; j < mol.getAtomCount(); j++) {     // 0..numAtoms
    IAtom atom = mol.getAtom(j);
    // XXX: linear traversal! 0..numBonds
    for (IBond bond : mol.getConnectedBondsList(atom)) {
      IAtom nbr = bond.getConnectedAtom(atom); 
      // XXX: linear traversal! 0..numAtoms avg=numAtoms/2
      next[j] += prev[mol.getAtomNumber(nbr)]; 
    }
  }o
  System.arraycopy(next, 0, prev, 0, next.length);
}

A New Start: API Rewrite?


Ultimately to fix this problem correctly, would involve changing the core AtomContainer representation, unfortunately this would require an API change, optimally I think adding the constraint that atoms/bonds can not be in multiple molecules would be needed**. This would be a monumental change and not one I can stomach right now.

Existing Trade Off: The GraphUtil class


In 2013 I added the GraphUtil class for converting an AtomContainer to a more optimal adjacency list (int[][]) that was subsequently used to speed up many algorithms including: ring finding, canonicalisation, and substructure searching. Each time one of these algorithm is invoked with an IAtomContainer the first step is to build the adjacency list 2D array.

Code 2 - GraphUtil usage
IAtomContainer mol = ...;
int[][]        adj = GraphUtil.toAdjList(mol);

// optional with lookup map to bonds
EdgeToBondMap  e2b = EdgeToBondMap.withSpaceFor(mol);
int[][]        adj = GraphUtil.toAdjList(mol, e2b);

Although useful the usage of GraphUtil is somewhat clunky requiring passing around not just the adjacency list but the original molecule and the EdgeToBondMap if needed.
Code 3 - GraphUtil Depth First Traversal
void visit(IAtomContainer mol, int[][] adj, EdgeToBondMap bondmap, int beg, int prev) {
  mol.getAtom(beg).setFlag(CDKConstants.VISITED, true);
  for (int end : adjlist[beg]) {
    if (end == prev)
      continue;
    if (!mol.getAtom(end).getFlag(CDKConstants.VISITED))
      visit(mol, adj, bondmap, end, beg);
    else
      bondmap.get(beg, end).setIsInRing(true); // back edge
  }
}

Using the GraphUtil approach has been successful but due to the clunky-ness I've not felt comfortable exposing the option of passing these through to public APIs. It was only ever meant as an internal optimisation to be hidden from the caller. Beyond causing unintentional poor performance (Code 1) what often happens in a workflow is GraphUtil is invoked multiple times. A typical use case would be matching multiple SMARTS against one AtomContainer.

A New Public API: Atom and Bond References


I wanted something nicer to work with and came up with the idea of using object composition to extend the existing Atom and Bond APIs with methods to improve performance and connectivity checks.

Essentially the idea is to provide two classes, and AtomRef and BondRef that reference a given atom or bond in a particular AtomContainer. An AtomRef knows about the original atom it's connected bonds and the index, the BondRef knows about the original bond, it's index and the AtomRef for the connected atoms. The majority of methods (e.g. setSymbol, setImplicitHydrogenCount, setBondOrder) are passed straight through to the original atom. Some methods (such as setAtom on IBond) are blocked as being unmodifiable.

Code 4 - AtomRef and BondRef structure
class AtomRef implements IAtom {
  IAtom         atm;
  int           idx;
  List<BondRef> bnds;
}

class BondRef implements IBond {
  IBond         bnd;
  int           idx;
  AtomRef       beg, end;
}

We can now re-write the Morgan-like relaxation (Code 1) using AtomRef and BondRef. The scaling of this algorithm is now ~N2 as you would expect.
Code 5 - Morgan-like Relaxation (AtomRef/AtomIter)
// Step 1. Initial up front conversion cost
AtomRef[] arefs = AtomRef.getAtomRefs(mol);

// Step 2. Algorithm body
int[]   prev  = new int[mol.getAtomCount()];
int[]   next  = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
  next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
  for (AtomRef aref : arefs) {
    int idx = aref.getIndex();
    for (BondRef bond : aref.getBonds()) {
      next[idx] += prev[bond.getConnectedAtom(aref).getIndex()];
    }
  }
  System.arraycopy(next, 0, prev, 0, next.length);
}   

The depth first implementation also improves in readability and only requires two arguments.
Code 6 - AromRef Depth First (AtomRef/AtomFlags)
// Step 1. Initial up front conversion cost
void visit(AtomRef beg, BondRef prev) {
  beg.setFlag(CDKConstants.VISITED, true);
  for (BondRef bond : beg.getBonds()) {
    if (bond == prev)
      continue;
    AtomRef nbr = bond.getConnectedAtom(beg);
    if (!nbr.getFlag(CDKConstants.VISITED))
      visit(nbr, bond);
    else
      bond.setIsInRing(true); // back edge
  }
} 


Benchmark


I like the idea of exposing the AtomRef and BondRef to public APIs. I wanted to check that the trade-off in calculating and using the AtomRef/BondRef vs the current internal GraphUtil. To test this I wrote a benchmark that implements some variants of a Depth First Search and Morgan-like algorithms. I varied the algorithm implementations and whether I used, IAtomContainer, GraphUtil, or AtomRef.

The performance was measured over ChEMBL 22 and averaged the run time performance over 1/10th (167,839 records). You can find the code on GitHub (Benchmark.java). Each algorithm computes a checksum to verify the same work is being done. Here are the raw results: depthfirst.tsv, and relaxation.tsv.


Depth First Traversal


A Depth first traversal is a linear time algorithm. I tested eight implementations that varied the graph data structure and whether I used an external visit array or atom flags to mark visited atoms. When looking just at initialisation time the AtomRef creation is about the same as GraphUtil. There was some variability between the different variants but I couldn't isolate where the different came from (maybe GC/JIT related). The runtime of the AtomRef was marginally slower than GraphUtil. Both were significantly faster (18-20x) than the AtomContainer to do the traversal. When we look at the total run-time (initialisation+traversal) we see that even for a linear algorithm, the AtomRef (and GraphUtil) were ~3x faster. Including the EdgeToBondMap adds a significant penalty.




Graph Relaxation


A more interesting test is a Morgan-like relaxation, as a more expensive algorithm (N2) it should emphasise any difference between the AtomRef and GraphUtil. The variability in this algorithm is whether we relax over atoms (AtomIter - see Code 1/5) or bonds (BondIter). We see a huge variability in AtomContainer/AtomIter implementation. This is because the algorithm is more susceptible to difference in input (molecule) size.



Clearly the AtomContainer/AtomIter is really bad (~80x slower). Excluding this results shows that as expected the AtomRef/AtomIter is slower than the GraphUtil/AtomIter equivalent (~2x slower). However because the AtomRef has a richer syntax, we can do a trick with XOR number storage to improve performance or iterate over bonds (BondIter) giving like-for-like speeds.



Conclusions


The proposed AtomRef and BondRef provide a convenience API to use the CDK in a natural way with efficient connectivity access. The conversion to an AtomRef is efficient and provides a speedup even for linear algorithms. The encapsulation facilities the passing as a public API parameter, users will be able to compute it ahead of time and pass it along to multiple algorithms.

I'm somewhat tempted to provide an equivalent AtomContainerRef allowing a drop-in replacement for methods that take the IAtomContainer interface. It is technically possible to implement writes (e.g. delete bond) efficiently in which case it would no longer be a 'Ref'. Maybe I'll omit that functionality or use a better name?

Footnotes


  • ** My colleague Daniel Lowe notes that OPSIN allows atoms to be in multiple molecules and know about their neighbours but it's a bit of a fudge. It's certainly possible with some extra book keeping but prevents some other optimisations from being applied.

Sunday, 10 July 2016

Generic Structure Depiction

Last week I attended the Seventh Joint Sheffield Conference on Chemoinformatics. It was a great meeting with some cool science and attendees. I had the pleasure of chatting briefly with John Barnard who's contributed a lot to the representation, storage, and retrieval of generic (aka Markush) structures (see Torus, Digital Chemistry - now owned by Lhasa).

At NextMove we've been doing a bit on processing sketches from patents (see Sketchy Sketches). I learnt a few things about how generic structures are typically depicted I thought be interesting to share.

Substituent Variation (R groups)


The most common type of generic feature is substituent variation, colloquially known as R groups. The variation allows concise representation with an invariant/fixed part of a compound and variable/optional part.


wherein R denotes

That is: anisole, toluene, or ethylbenzene.

Substituent Labels


Multiple substituent labels may be distinguished by a number R1, R2, ... Rn. However in reality, any label can and will be used. This can be particularly confusing when they collide with elements, examples include: Ra (Radium), Rg (Roentgenium) B (Boron), D (Deuterium), Y (Yttrium), W (Tungsten). The distinction between the label Ra and Radium may be semantically captured by a format but lost in depiction.

To distinguish such labels we can style them differently. By using superscripting and italicizing the label the distinction becomes clear and also somewhat improves the aesthetics of numbered R groups. We avoid subscript due to ambiguities with stoichiometry, for example: –NR2.

Attachment Points


For substituents there are different notation options. In writing, radical nomenclature is used, for the above example we'd say: methyl-oxyl (-OMe), ethyl (-Et), or methyl (-Me). However this doesn't translate well to depictions: .

The CTfile actually does stores substituents this way and specifies the attachment point (APO) information separately.
$RGP
  1
$CTAB
  2  1  0  0  0  0            999 V2000
    1.9048   -0.0893    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.6192    0.3232    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  APO  1   1   1
M  END
$END CTAB
$CTAB
  1  0  0  0  0  0            999 V2000
    1.9940   -1.2869    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
M  APO  1   1   1
M  END
$END CTAB
$CTAB
  2  1  0  0  0  0            999 V2000
    1.8750   -2.3286    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5895   -1.9161    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
M  APO  1   1   1
M  END
$END CTAB
$END RGP
Alternatively we may use a virtual or 'null' atom. We can convert to/from CTfile format although it's slightly easier to delete the null atom that add it on, due to coordinate generation. A disadvantage of this is the atom count isn't accurate, however the labelled group is also a type of null atom and already distorts the atom count. There are unfortunately different ways of depicting this null atom.
Don't use a dative bond style! You have to fudge the valences and just doesn't work, how would I show a double bond attachment?

The first time I'd encountered attachment points was in ChEBI where and R group means 'something attaches here' (CHEBI:58314, CHEBI:52569), whilst a 'star' label means 'attaches to something' (CHEBI:37807, CHEBI:32861). This actually a nice way of thinking about it, like two jigsaw pieces the asymmetry allows the substituent to connect to the labelled atom.

The 'star' atom used by ChEBI is tempting to use as there is a star atom in SMILES.
*OC
*C
*CC

However a '*' in SMILES actually means 'unspecified atomic number', some toolkits impose additional semantics. ChemAxon reads a 'star' to mean 'any atom', whilst OEChem, Indigo, and OpenBabel actually read more like an R Group, with [*:1] and [*:2] being R1 and R2 etc. ChemAxon Extended SMILES allows us to explicitly encode attachment points.
*OC |$_AP1$|
*C |$_AP1$|
*CC |$_AP1$|
I opted to implement the wavy line notation in CDK which is preferred by IUPAC graphical representation guidelines.
A major disadvantage of this notation is mis-encoding by users mistaking it for a wavy up/down stereo bond. I talk more about this in the poster (Sketchy Sketches) but the number of times you see the following drawn:
The captured connection table for that sketch does not have null atoms but instead uses carbon:

Saturday, 21 November 2015

Bringing Molfile Sgroups to the CDK - Rendering Tips

In the last but one post I gave a demonstration of S(ubstructure)group rendering in the CDK. Now I want to give some implementation insights.

Abbreviations (Superatoms)

Abbreviations contract part of a structure to a line formula or common abbreviation.


Full-structureAbbreviated-structure

Abbreviating too much or using unfamiliar terms (e.g. maybe using CAR for carbazole) can make a depiction worse. However some structures, like CHEMBL590010, can be dramatically improved.

CHEMBL590010

One way to implement abbreviations would be by modifying the molecule data structure with literal collapse/contract and expand operations. Whilst this approach is perfectly reasonable, deleting atoms/bonds is expensive (in most toolkits) and it somewhat subtracts the "display shortcut" nature of this Sgroup.

For efficiency abbreviations are implemented by hiding parts of the depictions and remapping symbols. Just before rendering we iterator over the Sgroups and set hint flags that these atoms/bonds should not be included in the final image. If there is one attachment (e.g. Phenyl) we remap the attach point symbol text to the abbreviation label ('C'->'Ph'). When there are no attachments (e.g. AlCl3) we add a new symbol to the centroid of the hidden atoms.

Hide atoms and bonds Symbol Remap Abbreviated Result

For two or more attachments (e.g. SO2) you also need coordinate remapping.

Multiple Group

Multiple groups allow, contraction of a discrete number of repeating units. They are handled similarly to the abbreviations except we don't need to remap parts.

CHEBI:1233

All atoms are present in the data structure but are laid out on top of each other (demonstrated below). We have a list of parent atoms that form the repeat unit. Therefore to display multiple groups we hide all atoms and bonds in the Sgroup except for parent atoms and the crossing bonds.

It's worth mentioning that hidden symbols are still generated but simply excluded from the final result. This allows bond back off for hetero atoms to be calculated correctly as is seen in this somewhat tangled example:

Brackets

Polymer and Multiple group Sgroups require rendering of brackets. Encoded in the molfile (and when laid out) brackets are described by two points, a line. It is therefore up to the renderer to decide which side of the line the tick marks should face.

I've seen some implementations use the order of the points to convey bracket direction. Another method would be to point the brackets at each other. As shown for CHEBI:59342 this is not always correct.


Poor bracket direction Preferred bracket direction
CHEBI:59342

I originally thought the solution might involve a graph traversal/flood-fill but it turns out there is a very easy way to get the direction correct. First we consider that brackets may or may not be placed on bonds, if a bracket is on a bond this information is available (crossing bonds).

  • For a bracket on a crossing bond exactly one end atom will be contained in the Sgroup, the bracket should point towards this atom.
  • If a bracket doesn't cross a bond then the direction should point to the centroid of all atoms in the Sgroup.

Saturday, 17 October 2015

Java Serialization: Great power but at what cost?

The default Java serialization framework provides a convenient mechanism for streaming in-memory Objects to another computer or storing them on disk. Beyond the obvious badness of being tied to the internal object layout (i.e. not stable through changes), serialization can be very inefficient. Externalization and libraries like Kyro are popular for improving performance.

SMILES: CO[C@@H]([C@H](OC(C)=O)[C@@H](OC(C)=O)[C@H](OC(C)=O)[C@H](OC(C)=O)COC(C)=O)SC

In the domain of Chemistry we have a rich variety of formats (e.g. SMILES) with which we can store molecules and reactions (in memory these are labelled graphs). Although these formats do not completely fulfil the utility of Object serialization they can be used as building block upon which we build. Not only are these defacto standards but they can be much faster and compact than default serialization of the in-memory connection table (graph) representation.

Recent History

Crafting efficient (de)serialization is beneficial and you can get great speed with simple setup. A few years ago I ran some experiments on writing an externalization stream for the Chemistry Development Kit (CDK) molecules (thread - High Performance Structure IO). Since the objects are huge any improvement over the default would be useful. This partly fed into the needs of CDK-Knime (a workflow tool) where I think CML was being used originally. From testing on ChEBI (~20,000 molecules) we see actually the ObjectInputStream was about as fast as an SDfile and much faster than CML. SDfiles are now much faster but that would be another post.

Read Performance
Method Time Size Throughput
AtomContainerStream 346 ms 11.1 MiB 63739 s-1
SDfile 4159 ms 51.7 MiB 5302 s-1
CML 18605 ms 91.5 MiB 1185 s-1
ObjectInputStream 5552 ms 93.9 MiB 3972 s-1

It was around that time that Andrew Dalke payed a visit to EMBL-EBI. In discussing what I was currently working on he promptly showed me how fast OEChem could read/write SMILES. Needless to say – pretty quick and as fast if not faster than my attempt at 'High Throughput' streaming.

The CDK now also has fast SMILES processing and I wanted to compare this to the serialization to see how much of a performance penalty there is.

Benchmark

For a benchmark I used 100,000 structures for ChEMBL 20.

$ shuf chembl_20.smi | head -n 100,000 > chembl_20_subset.smi

Writing it to a ObjectOutputStream takes 28.78 seconds. The SMILES subset file takes up 6.8 MiB on disk whilst the serialized objects take up 295 MiB. Ouch, that's 42x larger.

Code 1 - Writing to an ObjectOutputStream
IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
SmilesParser smipar = new SmilesParser(bldr);

String srcname = "/data/chembl_20_subset.smi";
String destname = "/data/chembl_20_subset.obj";

try (InputStream in = new FileInputStream(srcname);
     Reader rdr = new InputStreamReader(in, StandardCharsets.UTF_8);
     BufferedReader brdr = new BufferedReader(rdr);
     ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(destname))) {
    String line;
    long t0 = System.nanoTime();
    while ((line = brdr.readLine()) != null) {
        try {
            IAtomContainer mol = smipar.parseSmiles(line);

            // stereochemistry does not implement serializable...
            // so need to remove it
            mol.setStereoElements(new ArrayList(0));

            oos.writeObject(mol);
        } catch (CDKException e) {
            System.err.println(e.getMessage());
        }
    }
    long t1 = System.nanoTime();
    System.err.printf("write time: %.2f s\n", (t1 - t0) / 1e9);
}

In CDK we first read SMILES with Beam and then convert to the CDK objects so we'll also look at that small overhead. Here I compare the time to read the 100,000 SMILES using Beam, CDK, and the objects using an ObjectInputStream. Both CDK and Beam take less than 1 second whilst the ObjectInputStream takes more than 50.

In terms of throughput (mol per sec) here is the kind of speed we hit. I also show the total elapsed time for all 15 repeats.

MethodMinMaxElapsed TimeSize
Deserialization1961 s-12089 s-112 m 16 s295 MiB
Kryo (Auto)42401 s-144557 s-133.9 s186 MiB
Kryo Unsafe (Auto)44854 s-147331 s-131.9 s231 MiB
CDK135286 s-1142126 s-110.7 s6.8 MiB
Beam347534 s-1489545 s-13.2 s6.8 MiB

Auxiliary Data

With a performance difference that huge why would anyone want to use Serialization? Some use-cases might be that a format doesn't store the parts we need. A common argument against SMILES is the lack of coordinates but we can simply store this supplementary to the SMILES if we no what the input order will be (Code 2).

Code 2 - Writing Coordinates with SMILES
IAtomContainer  mol = ...;
// 'Generic' - avoid canon SMILES we are not doing identity check
SmilesGenerator sg  = SmilesGenerator.generic(); 

int   n     = mol.getAtomCount();
int[] order = new int[n];

// the order array is filled up as the SMILES is generated
String smi = sg.create(mol, order);

// load the coordinates array such that they are in the order the atoms
// are read when parsing the SMILES
Point2d[] coords = new Point2d[mol.getAtomCount()];
for (int i = 0; i < coords.length; i++)
  coords[order[i]] = container.getAtom(i).getPoint2d();

// SMILES string suffixed by the coordinates
String smi2d = smi + " " + Arrays.toString(coords);

Using that same technique it's relatively simply to extend this to handle arbitrary data fields and it even forms the basis of ChemAxon's extended SMILES. A more advanced method would be combining the SMILES with a DataOutputStream since we know how many coordinates there are expected to be.

Summary

I'm certainly not against a performant AtomContainerInputStream but the default Java serialization should never be the first choice. Hopefully this post has put some numbers on why and will discourage knee-jerk usage.

Update

Added Kryo performance