Tests were all on a Dell Studio XPS 435MT with an Intel i7-920 at 2.66 GHz, 6 GiB of memory, and a 250GiB Samsung 840 SSD running Ubuntu 14.04.
The test case was to read a mmCIF file, and for each atom in the atom_site table extract its element, its name, its residue name, its chain identifier, its residue number, and its x, y, and z coordinates. This test case isolates the performance of extracting the atom data. A more realistic test case would include building the connectivity, but that would have the same overhead for all test programs.
Four programs were compared:
A stylized PDBx/mmCIF reader that is a step up from grepping the mmCIF file for ATOM and HETATM records. It parses the atom_site table headers to find the columns it is interested in, scans the first row of the atom_site table for the column offsets, and then extracts the atomic data. And it stops scanning file after reading atom_site table, so it only works for files with just one data block.
A fully conformant CIF reader that can switch between the PDBx/mmCIF stylized parsing and traditional parsing that tokenizes the input. It uses a callbacks for each table that an application wants parsed, and callbacks for individual columns. It knows the order in which tables need to be parsed, so it can skip tables and later jump back to reparse a table if necessary. Works for files with multiple data blocks.
Two variations are benchmarked, a version that takes advantage of PDBx/mmCIF styling, and one that just uses the default tokenizing code.
- cifparse-obj V7-1-05
The PDB’s example mmCIF parser that tokenizes the input and saves it as tables for later processing. No backtracking is needed, since everything is saved. It can also write mmCIF files. Works for files with multiple data blocks.
- ucif svn revision 18662, 2013-11-19
Part of “iotbx.cif: a comprehensive CIF toolbox”. Tokenizes the input file and has a virtual function for are CIF loops, and a virtual function for table data items that are not in a loop. Uses a lot of memory, but unsure if it saves everything like cifparse-obj, or it’s just a single pass through the file. Works for files with multiple data blocks.
And the results were:
program name/ code size PDB ID# of atomsfile size 9rsa.cif2106260 KiB 2kzt.cif20381625 MiB 2kox.cif78784076 MiB 3j3q.cif2440800254 MiB simple15 KiB
readcif stylized87 KiB
readcif tokenized83 KiB
The time is lowest time of 20 consecutive runs. Memory use is the peak memory use.
cifparse-obj and ucif are fundamentally slower because they convert every data value into a C++ string, a dynamically allocated resource. readcif also tokenizes the input, but avoids this overhead by returning pointers to the start and ending characters of a data value.
As expected, the simple code is the fastest for stylized PDBx/mmCIF files. The readcif code for stylized PDBx/mmCIF files is next best at ~1.2 times slower. The fully tokenizing readcif code is ~3.6 times slower than the sylized code and ~4.5 times slower than the simple code. The cifparse-obj code is ~63 times slower than the stylized readcif code and consumes more memory – this is expected because it saves all of the data. ucif is ~110 times slower and consumes way more memory – this was unexpected and deserves a closer look by the ucif developers.
It should be possible to speed up readcif a little bit more by exposing more of the tokenizing internals to the parsing code at the expense of having to write separate code for PDB mmCIF files. But readcif is already close to optimal, and it is unclear if any other improvements would be noticeable once connectivity and other derived information is computed.
Benefits of PDBx/mmCIF Styling¶
It is currently not possible to robustly detect if a mmCIF file is stylized
It is likely that it stylized if the filename looks like a PDB identifier
and the associated dictionary is mmcif_pdbx.dic version 4 or newer.
But that guess could be wrong, and if it is wrong,
there is no indication of that fact that the input is corrupted.
As of 10 June 2014,
the above heuristic appears to work for the mmCIF files for PDB entries
appears to work.
However, the PDB’s large structure examples
files have the numbers in tables right-justified
instead of left-justified, so the stylized reading might fail.
Luckily, those file names are not a 4-character PDB identifier.
Looking at one test case, 3j3q.cif, let’s examine the benefits of various PDBx/mmCIF styling rules:
with tags/keywords at start of line
with fixed columns
+ fixed length rows (trailing spaces)
+ tables terminated with comment