ChimeraX Fast mmCIF Guidelines

Date:April 2017

Introduction

ChimeraX’s mmCIF reader is fast. It is fast because:

  • it uses a C++ library to parse a mmCIF file
  • it takes advantage of the wwPDB’s PDBx/mmCIF styling
  • it only fully parses what it needs
  • it doesn’t create Python objects for each atom, bond, residue, nor chain

Being fast is good, but accurately reconstructing atomic structures is better. ChimeraX does the best it can with the data it is given. Some errors are detected and logged, but the ultimate responsibility for the fidelity of the data belongs to the user.

The following discusses what is needed to be fast, and what data is understood, by ChimeraX.

mmCIF Terminology

mmCIF, MacroMolecular Crystallographic Interchange Format, is a file format for describing large atomic structures and is based on the CIF format for small molecules. The file content is human-readable and is organized as a series of categories, formatted as tables. A table consists of a series of column names followed by one or more rows of data. The column names are period separated category name followed by a keyword, e.g., audit_conform.dict_name. A table with a single row can be written as a series of pairs of column name and data value. The categories, keywords, and the domain and range of the data are defined in a dictionary written in the CIF format. Using the dictionary, an mmCIF file can be validated (just like an XML file with its corresponding XML Schema).

Which mmCIF dictionary?

A generic problem with CIF files is that they usually don’t embed which dictionary the file corresponds to. This is exacerbated by the fact that semantically different file types share the same .cif file suffix, and that the dictionaries for the mmCIF and CIF formats overlap but do not describe atomic structures in the same way. Consequently, when given a .cif file, users cannot predict whether the file is for small molecules or macromolecules. Furthermore, applications cannot reliably parse .cif files and may need to query the user for guidance. A simple solution would be to use unique suffixes for different type of CIF files, e.g. use .cif for small molecules but .mmcif for macromolecules, but this is not current practice.

The World Wide PDB addresses the missing dictionary information by explicitly listing the data definition dictionary name and version in the audit_conform table in mmCIF files it distributes. Unfortunately, PDB Europe’s updated mmCIF files, currently do not include this information, thus making it difficult to validate these files. ChimeraX expects .cif files to conform to a published PDBx/mmCIF dictionary, whether the data definition dictionary information is present or not.

Performance

Using a data definition dictionary to guide the parsing of a CIF file, while flexible, tends to be slow because interpreting each datum requires looking up its definition it the dictionary. Instead, for speed, ChimeraX hardcodes information for a few select categories, skips over categories it doesn’t need, and treats other categories as generic tables of string data.

The ChimeraX code for reading CIF files consists of two parts:

  • a fast CIF parser that provides the framework for fast parsing (skipping unused categories, PDBx/mmCIF styling), and
  • a Python module (implemented in C++) that converts the parsed data into ChimeraX’s internal data structures.

ChimeraX takes advantage of the cross-referencing of data within a mmCIF file to reconstruct data relationships without reading the tables that explicitly list those relationships. For example, the mapping of author identifiers to normative identifiers is contained in both the pdbx_poly_seq_scheme table and the atom_site table. Since the atom_site table is always parsed, we were able to speed up reading a mmCIF file by about 3% by retaining the information from the atom_site table and not parsing the pdbx_poly_seq_scheme table at all.

Stylized PDBx/mmCIF Files

mmCIF files from the World-Wide PDB (wwPDB) are typically formatted for fast parsing. This is known as PDBx/mmCIF styling. If a CIF file is known to use PDBx/mmCIF stylized formatting, then parsing can be almost four times faster. Currently, ChimeraX uses a heuristic to detect that a mmCIF file is stylized: it is assumed only when a mmCIF file uses the mmcif_pdbx dictionary version 4 or later. However, it is preferrable to explicity enable fast stylized parsing by setting the values of specific annotation flags in the CIF file. ChimeraX has extended the metadata in audit_conform category with explicit annotations as detailed below. (In the future ChimeraX’s use of a heuristic may be discontinued after explicit annotations becomes widespread.)

The important aspects of styling are: (1) reserved words and tags are always case-sensitive, and (2) categories with fixed column width tables are explicitly so noted.

Case-sensitive words and tags (conformance should be explicitly annotated with audit_conform.pdbx_keywords_flag Y):

  • CIF reserved words must be in lowercase
  • Category names and keywords must match the case given in the associated dictionary, e.g., atom_site.Cartn_x in mmcif_pdbx.dic
  • CIF reserved words and tags must only appear immediately after an ASCII newline

Fixed width column tables (conformance should be explicitly annotated with audit_conform.pdbx_fixed_width_columns followed by a space separated list of categories):

  • All columns must be left-aligned
  • Each row of data must include all columns
  • All rows must be the same length, using trailing spaces as padding
  • The end of the category’s data values is terminated by a comment line

Performance improvements are especially noticeable when processing large tables such as atom_site and atom_site_ansitrop.

Reconstructing Connectivity

One of the deficiencies of the mmCIF documentation is the lack of a published protocol for reconstructing the atomic connectivity.

The connectivity between residues is not given for standard amino and nucleic acids. Rather, it is inferred from the polymer sequence data.

The internal connectivity of residues is not given in the wwPDB’s mmCIF files. That information is available separately in a Chemical Component Dictionary, CCD, that “is updated with each weekly PDB release.” ChimeraX uses the Internet to fetch individual residue templates from the RCSB PDB’s Ligand Expo instead of having users update the huge CCD each week. However, there are at least two curation problems with the residue templates: (1) the templates are sometimes incomplete, e.g., missing the H1 and H3 for amino acids at the N-terminus of proteins (the UNL and UNX templates intentionally have no atoms nor bonds because there is no implied connectivity), and (2) the templates sometimes incorrectly identify metal coordination bonds as covalent bonds (e.g., HEM). In both cases, custom code has to be written to correct the problem. (In the case of (1) above, the wwPDB has alternate templates with protonation variants for the standard amino acids. But the general case requires that bonds be computed using element-based distance cutoffs.)

Another potential problem arises when a residue template is not available, e.g., a mmCIF file of a new structure not yet deposited in the PDB. In this case, a template should be embedded directly in the mmCIF file with the chem_comp and chem_comp_bond tables. As a last resort, if a template is missing or incomplete, ChimeraX will connect the residue using element-dependent bond distances — ideally this should never be necessary.

Finally, the treatment of waters in wwPDB mmCIF files potentially presents a problem. The atom_site’s label_comp_id, label_asym_id, label_entity_id, and label_seq_id data values are identical, so the waters appear to be all in one residue. (If they were unique, they could be used, along with the other label_ keywords, as a unique key for a database table.) Fortunately, in practice, the optional auth_seq_id keyword’s data values are usually included in the file and can be used to distinguish each water. Any mmCIF files without unique auth_seq_ids must have unique label_seq_ids, that is, the solvent must be uniquely numbered to indicate that the residues are distinct.

With the above considerations, the connectivity protocol becomes, for each CIF data block:

  1. Read audit_conform for metadata needed to speed up parsing
  2. Read chem_comp and chem_comp_bond for embedded residue templates
  3. Read entity_poly_seq for sequence information (and thus polymer connectivity)
  4. Read atom_site for atomic coordinates
  5. Read struct_conn for non-standard connectivity
  6. Assemble the atomic structure while compensating for the above deficiencies.

Multiple CIF data blocks are treated as multiple atomic structures.

Embedded Residue Templates

The PDBe’s updated mmCIF files embed residue templates for connectivity. This means that the chem_comp_bond and chem_comp_atom tables for all residue types in the structure are added to the mmCIF file. A reasonable method for creating the chem_comp_bond the chem_comp_atom tables is to concatenate the corresponding tables from the various CCD residue templates listed in the chem_comp table. Including these two tables makes the mmCIF files self-contained, i.e., no templates need to be fetched via the Internet.

Best Practices

ChimeraX performs a linear scan of a mmCIF file for the data it needs. To avoid the memory cost of saving information before it is needed, ChimeraX will note where a category’s data is in the file and then backtrack to parse that data when it’s needed. Re-reading data takes time, so having the data in the desired order can speed up processing a file considerably.

The best presentation order of the mmCIF data for ChimeraX is as follows:

  1. audit_conform table near beginning of the file and:
  1. explicitly give PDBx/mmCIF styling information (e.g., that the atom_site table uses fixed width columns)
  2. explicitly give the mmCIF dictionary name and version for validating
  1. Connectivity information for non-standard residues, with the chem_comp table preceding the chem_comp_bond table
  2. entity_poly_seq table (sequence information)
  3. atom_site table (coordinate data)
  4. atom_site_anisotrop table
  5. struct_conn table
  6. struct_conf table
  7. struct_sheet_range table

The order in which other tables appear does not currently matter. For future compatibility be sure to define data before it is referenced. For example, the entity table should come before the entity_poly_seq table.

Recognized mmCIF Categories and Keywords

For reference, all of the mmCIF categories and keywords that ChimeraX parses are listed below. Some keywords are required to be present in a category for its data to be used. Afterwards, there is a brief description of the categories and why they are important. All of the categories are considered optional, but if one is missing, then ChimeraX might incorrectly infer what could have been explicitly given. For instance, if the tables for the secondary structure categories are missing then ChimeraX needs to compute that information. Also, the atom_site table is effectively required because, without it, there is no resulting atomic structure.

Recognized Data Categories and Keywords

Category Keywords († = required)
atom_site id, label_entity_id, label_asym_id†, auth_asym_id, pdbx_PDB_ins_code, label_seq_id†, auth_seq_id, label_alt_id, type_symbol†, label_atom_id†, auth_atom_id, label_comp_id†, auth_comp_id, Cartn_x†, Cartn_y†, Cartn_z†, occupancy, B_iso_or_equiv, pdbx_PDB_model_num
atom_site_anisotrop id†, U[1]_[1]†, U[1]_[2]†, U[1]_[3]†, U[2]_[2]†, U[2]_[3]†, U[3]_[3]†
audit_conform dict_name, dict_version, pdbx_keywords_flag, pdbx_fixed_width_columns
chem_comp id†, type†
chem_comp_bond comp_id†, atom_id_1†, atom_id_2†
entity_poly_seq entity_id†, num†, mon_id†, hetero
entity id†, pdbx_description
entity_src_gen entity_id†, pdbx_gene_src_scientific_name†
entity_src_nat entity_id†, pdbx_organism_scientific†
entry id†
pdbx_database_PDB_obs_spr id†, pdb_id†, replace_pdb_id†
pdbx_struct_assembly id†, details†
pdbx_struct_assembly_gen assembly_id†, oper_expression†, asym_id_list†
pdbx_struct_oper_list id†, matrix[1][1]†, matrix[1][2]†, matrix[1][3]†, matrix[2][1]†, matrix[2][2]†, matrix[2][3]†, matrix[3][1]†, matrix[3][2]†, matrix[3][3]†, vector[1]†, vector[2]†, vector[3]†
struct_conf id†, conf_type_id†, beg_label_asym_id†, beg_label_comp_id†, beg_label_seq_id†, end_label_asym_id†, end_label_comp_id†, end_label_seq_id†
struct_conn conn_type_id†, ptnr1_label_asym_id†, pdbx_ptnr1_PDB_ins_code, ptnr1_label_seq_id†, ptnr1_auth_seq_id, pdbx_ptnr1_label_alt_id, ptnr1_label_atom_id†, ptnr1_label_comp_id†, ptnr1_symmetry, ptnr2_label_asym_id†, pdbx_ptnr2 _PDB_ins_code, ptnr2_label_seq_id†, ptnr2_auth_seq_id, pdbx_ptnr2 _label_alt_id, ptnr2_label_atom_id†, ptnr2_label_comp_id†, ptnr2_symmetry, pdbx_dist_value
struct_sheet_range sheet_id†, id†, beg_label_asym_id†, beg_label_comp_id†, beg_label_seq_id†, end_label_asym_id†, end_label_comp_id†, end_label_seq_id†
atom_site
Contains atom coordinates. Typically the largest table in a mmCIF file. wwPDB mmCIF files use fixed width columns for the data.
atom_site_anisotrop
Contains anisotropic displacement data for atoms. While the specification for the atom_site category has provisions to include the anisotropic displacement data, in practice it is not. Consequently, ChimeraX only looks in the atom_site_anisotrop table for the anisotropic displacement data. wwPDB mmCIF files use fixed width columns for the data.
audit_conform
Contains metadata about the CIF file. Can specify the CIF dicitionary and version the data conforms to. Extended by ChimeraX to hold the explicit styling annotations with pdbx_keywords_flag and pdbx_fixed_width_columns keywords.
chem_comp
Contains information about the chemical components in the structure. Used for embedded residue templates.
chem_comp_bond
Contains connectivity of chemical components. Used for embedded residue templates. Currently only present in “updated” PDB files from the PDBe.
entity
Contains details “about the molecular entities that are present in the crystallographic structure.” Used to extract description of chains.
entity_poly_seq
Contains the sequence of residues in a chain. Used to know which residues to connect and where there are structural gaps.
entity_src_gen
Contains “details of the source from which the entity was obtained in cases where the source was genetically manipulated.” Used to extract scientific name of entities.
entity_src_nat
Contains “details of the source from which the entity was obtained in cases where the entity was isolated directly from a natural tissue.” Used to extract scientific name of entities.
entry
Contains the 4-letter PDB identifier. Used to tell user if there is a newer version available.
pdbx_database_PDB_obs_spr
Contains information about obsolete and superseded PDB entries. Used to tell user if there is a newer version available.
pdbx_struct_assembly
Contains information “about the structural elements that form macromolecular assemblies.”
pdbx_struct_assembly_gen
Contains information “about the generation of each macromolecular assemblies.”
pdbx_struct_oper_list
Contains transform matrix for symmetry operations.
struct_conf
Contains helix and turn residue ranges. Formerly held strand residue ranges but that information is now in the struct_sheet_range data.
struct_conn
Contains non-standard connectivity. Standard amino and nucleic acid connectivity is given by chemical component templates.
struct_sheet_range
Contains strand residue ranges and associated sheets.