Writing mmCIF Files in ChimeraX

Date

April 2018

Introduction

There are several goals for ChimeraX’s mmCIF writer:

  • write out the information that ChimeraX wants to read

  • output inferred parts of mmCIF (so other mmCIF readers don’t have to infer the same things)

  • use stylized output (for readability and fast reading)

  • output should pass the PDB’s online deposition validator

  • output should pass the PDB’s mmCIF validation software

  • full connectivity should optionally be generated (not done)

What ChimeraX Wants

What ChimeraX wants to read from a mmCIF file is documented in ChimeraX Fast mmCIF Guidelines. Saving connectivity is a major issue and is discussed separately.

Inferred mmCIF Tables

Many data relationships in a mmCIF file can be inferred. For example, from the contents of the atom_site table, the chain identifier to entity identifier mapping can be computed (the struct_asym table). Consequently, ChimeraX’s mmCIF reader skips reading tables when it can infer all of the information it needs from a table. Other application’s mmCIF readers might not compute the same information, so ChimeraX outputs the tables it infers for completeness. This is also needed for the mmCIF files to validate.

Stylized Output

As shown in Benchmarking readcif, stylized PDBx/mmCIF output can be read faster than unstylized output. It is also easier to visually scan fixed column width tables for interesting values. ChimeraX ouputs the chimerax_audit_syntax.case_sensitive_flag as Y to indicate that all keywords are lowercase and appear at the beginning of a line. And outputs chimerax_audit_syntax.fixed_width with just the atom_site and atom_site_anisotrop tables listed (since the those tables are typically the largest ones in the mmCIF file).

Validation

We tested the mmCIF output in two ways:

  • Using the World Wide PDB’s online validator

  • Validated the mmCIF file using the associated mmCIF dictionary

There are several software packages that will try to validate a mmCIF file using the associate dictionary. We used the mmCIF Dictionary Suite from the wwPDB, since it supports the current mmCIF dictionary (version 5).

Connectivity

TODO: full connectivity should optionally be generated

Problems

ChimeraX does not save enough information to completely regenerate some of the mmCIF tables it uses.

Heterogeneous information is discarded when reading, so it is not present when writing.

ChimeraX is only concerned about strands, so the sheet information in the struct_sheet_range table is lost. On output, the sheet identifier is given as unknown (?).

In other cases, the original mmCIF table, that is copied verbatim into the output, is non-conforming. For example, in wwPDB provided mmCIF files, the mandatory item pdbx_src_id is often missing from the entity_src_gen and entity_src_nat tables.

Only the single letter code for a residue in a chain’s sequence is kept. So, if that residue is not present in any entity with the same sequence, then the name of the residue defaults to the standard one for that letter.

Generated mmCIF Categories and Keywords

For reference, all of the mmCIF categories and keywords that ChimeraX outputs are listed below.

Recognized Data Categories and Keywords

Category

Keywords (|req| = required)

atom_type

symbol

atom_site

group_PDB, id, label_entity_id, label_asym_id, auth_asym_id, pdbx_PDB_ins_code, label_seq_id, auth_seq_id, label_alt_id, type_symbol, label_atom_id, label_comp_id, Cartn_x, Cartn_y, Cartn_z, occupancy, B_iso_or_equiv, pdbx_PDB_model_num

atom_site_anisotrop

id, U[1]_[1], U[1]_[2], U[1]_[3], U[2]_[2], U[2]_[3], U[3]_[3]

audit_conform

dict_name, dict_version

chimerax_audit_syntax

case_sensitive_flags, fixed_width

cell

copied from original file

chem_comp

id, type, name extract from original file

citation

merged from original file id, title, journal_abbrev, journal_volume, year, page_first, page_last, journal_issue, pdbx_database_id_PubMed, pdbx_database_id_DOI

citation_author

citation_id, name, ordinal

entry

id

entity

id, type, pdbx_description

entity_poly

entity_id, type, nstd_monomer, pdbx_seq_one_letter_code_can

entity_poly_seq

entity_id, num, mon_id

entity_src_gen

copied from original file

entity_src_nat

copied from original file

pdbx_poly_seq_scheme

entity_id, asym_id, mon_id, seq_id, pdb_strand_id, pdb_seq_num, pdb_ins_code

pdbx_struct_assembly

copied from original file

pdbx_struct_assembly_gen

copied from original file

pdbx_struct_oper_list

copied from original file

software

name, version, location, classification, os, type, citation_id, pdbx_ordinal

struct_asym

id, entity_id

struct_conf

id, conf_type_id, beg_label_asym_id, beg_label_comp_id, beg_label_seq_id, end_label_asym_id, end_label_comp_id, end_label_seq_id, beg_auth_asym_id, beg_auth_seq_id, pdbx_beg_PDB_ins_code, end_auth_asym_id, end_auth_seq_id, pdbx_end_PDB_ins_code,

struct_conf_type

id

struct_conn

id, conn_type_id, ptnr1_label_asym_id, ptnr1_auth_asym_id, pdbx_ptnr1_PDB_ins_code, ptnr1_label_seq_id, ptnr1_auth_seq_id, pdbx_ptnr1_label_alt_id, ptnr1_label_atom_id, ptnr1_label_comp_id, ptnr1_symmetry, ptnr2_label_asym_id, ptnr2_auth_asym_id, pdbx_ptnr2_PDB_ins_code, ptnr2_label_seq_id, ptnr2_auth_seq_id, pdbx_ptnr2_label_alt_id, ptnr2_label_atom_id, ptnr2_label_comp_id, ptnr2_symmetry, pdbx_dist_value

struct_conn_type

id

struct_sheet_range

sheet_id, id, beg_label_asym_id, beg_label_comp_id, beg_label_seq_id, end_label_asym_id, end_label_comp_id, end_label_seq_id symmetry, beg_auth_asym_id, beg_auth_seq_id, pdbx_beg_PDB_ins_code, end_auth_asm_id, end_auth_seq_id, pdbx_end_PDB_ins_code

symmetry

copied from original file