mmCIF: A Chimera Developer Perspective

Tom Goddard
October 22, 2013

Two main points: First a production quality mmcif reading and writing library will be needed to make mmcif widely used. Second, the mmcif files should explicitly list the bonds -- which pairs of atoms to connect.

mmCIF is Seldom Used in Chimera

Chimera mmCIF Reader is Slow and a Memory Hog

Comparison of speed and memory use of 4 different file readers on molecular structures of different sizes.

File parsing speed in seconds
HIV RTProteasome5 ribosomesHIV capsid
Atom count8,51370,538717,8052,440,800
Chimera mmCIF 1.57 15.7 187 > 5000 sec
RCSB CIFPARSE-OBJ 0.11 0.82 8.42 29
Chimera PDB 0.04 0.35 3.26 12.5
Chimera Next Gen mmCIF 0.006 sec0.05 0.621.8
File parsing, memory use in Mbytes
HIV RTProteasome5 ribosomesHIV capsid
mmCIF file size1 Mbytes8110266
Chimera mmCIF 115 960 9500 > 23000 Mb
RCSB CIFPARSE-OBJ 10.5 76 709 2330
Chimera PDB 7.6 60 560 1815
Chimera Next Gen mmCIF 1.918 279423

Code Complexity

Parsing mmCIF Tables to create Atom, Residue, and Chain objects

Creating molecular objects involves matching corresponding names in different mmCIF tables, such as chain, residue name, residue number and atom name.

mmCIF atom table (atom_site):

  ATOM   1    N  N  PRO A 1 4  -62.315  -62.643 -5.519  1.00 100.20
  ATOM   2    C  CA PRO A 1 4  -61.373  -61.942 -4.649  1.00 110.90
  ATOM   3    C  C  PRO A 1 4  -61.730  -60.460 -4.495  1.00 108.42
  ATOM   4    O  O  PRO A 1 4  -60.863  -59.592 -4.628  1.00 100.29
  ATOM   5    C  CB PRO A 1 4  -60.037  -62.112 -5.380  1.00 108.66
  ...

mmCIF bond table (struct_conn):

  C DT 3 N3   D A 27 N1
  C DT 3 O4   D A 27 N6
  C DA 4 N1   D U 26 N3
  C DA 4 N6   D U 26 O4
  C DT 5 N3   D A 25 N1
  C DT 5 O4   D A 25 N6
  C DG 6 N1   D C 24 N3
  C DG 6 N2   D C 24 O2
  ...

If the file reader handles matching all the columns in mmCIF table to build molecular objects, application code can be simple. For example, printing names and positions of atoms connected to a given atom:

  for a in atom.bondedAtoms:
    print a.name, a.x, a.y, a.z

Where are the Bonds?

How Chimera Figures out which Atoms are Bonded

mmCIF files PDB format files

These methods will produce the incorrect bonds when distances between atoms are far from normal.

Bond Templates are Incomplete

Missing template bonds:

Missing inter-residue templates:

Many residue types with no templates:

Are chemical component bond templates used by other software?

How to Include Bonds Explicitly in mmCIF Files

loop_
_struct_conn.id 
_struct_conn.conn_type_id 
_struct_conn.ptnr1_label_asym_id 
_struct_conn.ptnr1_label_comp_id 
_struct_conn.ptnr1_label_seq_id 
_struct_conn.ptnr1_label_atom_id 
_struct_conn.ptnr2_label_asym_id 
_struct_conn.ptnr2_label_comp_id 
_struct_conn.ptnr2_label_seq_id 
_struct_conn.ptnr2_label_atom_id 
1  c C DT 3  N3 D A 27 N1
2  c C DT 3  O4 D A 27 N6
3  c C DA 4  N1 D U 26 N3
4  c C DA 4  N6 D U 26 O4
...

Conclusions