Benchmarking readcif

Author:Greg Couch
Organization:RBVI, University of California at San Francisco
Contact:gregc@cgl.ucsf.edu
Copyright:© Copyright 2014 by the Regents of the University of Californa. All Rights reserved.
Last modified:2014-6-17

The goal of this benchmark is to compare the performance of readcif versus other C++ mmCIF readers, cifparse-obj and ucif, and to quantify how much faster the stylized PDBx/mmCIF can be parsed.

Benchmark Results

Tests were all on a Dell Studio XPS 435MT with an Intel i7-920 at 2.66 GHz, 6 GiB of memory, and a 250GiB Samsung 840 SSD running Ubuntu 14.04.

The test case was to read a mmCIF file, and for each atom in the atom_site table extract its element, its name, its residue name, its chain identifier, its residue number, and its x, y, and z coordinates. This test case isolates the performance of extracting the atom data. A more realistic test case would include building the connectivity, but that would have the same overhead for all test programs.

Four programs were compared:

simple
A stylized PDBx/mmCIF reader that is a step up from grepping the mmCIF file for ATOM and HETATM records. It parses the atom_site table headers to find the columns it is interested in, scans the first row of the atom_site table for the column offsets, and then extracts the atomic data. And it stops scanning file after reading atom_site table, so it only works for files with just one data block.
readcif

A fully conformant CIF reader that can switch between the PDBx/mmCIF stylized parsing and traditional parsing that tokenizes the input. It uses a callbacks for each table that an application wants parsed, and callbacks for individual columns. It knows the order in which tables need to be parsed, so it can skip tables and later jump back to reparse a table if necessary. Works for files with multiple data blocks.

Two variations are benchmarked, a version that takes advantage of PDBx/mmCIF styling, and one that just uses the default tokenizing code.

cifparse-obj V7-1-05
The PDB’s example mmCIF parser that tokenizes the input and saves it as tables for later processing. No backtracking is needed, since everything is saved. It can also write mmCIF files. Works for files with multiple data blocks.
ucif svn revision 18662, 2013-11-19
Part of “iotbx.cif: a comprehensive CIF toolbox”. Tokenizes the input file and has a virtual function for are CIF loops, and a virtual function for table data items that are not in a loop. Uses a lot of memory, but unsure if it saves everything like cifparse-obj, or it’s just a single pass through the file. Works for files with multiple data blocks.

And the results were:

program name
/ code size
PDB ID
# of atoms
file size
9rsa.cif
2106
260 KiB
2kzt.cif
203816
25 MiB
2kox.cif
787840
76 MiB
3j3q.cif
2440800
254 MiB
simple
15 KiB
time 633 usec .0353 sec .127 sec .413 sec
memory 12.8 MiB 48.4 MiB 136 MiB 458 MiB
readcif stylized
87 KiB
time 988 usec .0447 sec .147 sec .497 sec
memory 12.8 MiB 48.5 MiB 136 MiB 458 MiB
readcif tokenized
83 KiB
time 2248 usec .160 sec .553 sec 1.81 sec
memory 12.8 MiB 48.5 MiB 136 MiB 458 MiB
cifparse-obj
514 KiB
time 36951 usec 3.18 sec 10.1 sec 33.8 sec
memory 15.8 MiB 319 MiB 905 MiB 2.98 GiB
ucif
184 KiB
time 48602 usec 5.46 sec 17.2 sec
out of
memory

memory 28.2 MiB 1.39 GiB 4.51 GiB

The time is lowest time of 20 consecutive runs. Memory use is the peak memory use.

Discussion

cifparse-obj and ucif are fundamentally slower because they convert every data value into a C++ string, a dynamically allocated resource. readcif also tokenizes the input, but avoids this overhead by returning pointers to the start and ending characters of a data value.

As expected, the simple code is the fastest for stylized PDBx/mmCIF files. The readcif code for stylized PDBx/mmCIF files is next best at ~1.2 times slower. The fully tokenizing readcif code is ~3.6 times slower than the sylized code and ~4.5 times slower than the simple code. The cifparse-obj code is ~63 times slower than the stylized readcif code and consumes more memory – this is expected because it saves all of the data. ucif is ~110 times slower and consumes way more memory – this was unexpected and deserves a closer look by the ucif developers.

Further Work

It should be possible to speed up readcif a little bit more by exposing more of the tokenizing internals to the parsing code at the expense of having to write separate code for PDB mmCIF files. But readcif is already close to optimal, and it is unclear if any other improvements would be noticeable once connectivity and other derived information is computed.

Benefits of PDBx/mmCIF Styling

It is currently not possible to robustly detect if a mmCIF file is stylized or not. It is likely that it stylized if the filename looks like a PDB identifier followed by .cif and the associated dictionary is mmcif_pdbx.dic version 4 or newer. But that guess could be wrong, and if it is wrong, there is no indication of that fact that the input is corrupted. As of 10 June 2014, the above heuristic appears to work for the mmCIF files for PDB entries appears to work. However, the PDB’s large structure examples files have the numbers in tables right-justified instead of left-justified, so the stylized reading might fail. Luckily, those file names are not a 4-character PDB identifier.

Looking at one test case, 3j3q.cif, let’s examine the benefits of various PDBx/mmCIF styling rules:

  3j3q.cif Speedup
fully tokenized 1.81 sec 1x
with tags/keywords at start of line 1.73 sec 1.05x
with fixed columns 0.603 sec 3.00x
  + fixed length rows (trailing spaces) 0.594 sec 3.05x
  + tables terminated with comment 0.570 sec 3.18x
with everything 0.485 sec 3.73x