Writing mmCIF Files in ChimeraX¶
Date: | April 2018 |
---|
Introduction¶
There are several goals for ChimeraX’s mmCIF writer:
- write out the information that ChimeraX wants to read
- output inferred parts of mmCIF (so other mmCIF readers don’t have to infer the same things)
- use stylized output (for readability and fast reading)
- output should pass the PDB’s online deposition validator
- output should pass the PDB’s mmCIF validation software
- full connectivity should optionally be generated (not done)
What ChimeraX Wants¶
What ChimeraX wants to read from a mmCIF file is documented in ChimeraX Fast mmCIF Guidelines. Saving connectivity is a major issue and is discussed separately.
Inferred mmCIF Tables¶
Many data relationships in a mmCIF file can be inferred. For example, from the contents of the atom_site table, the chain identifier to entity identifier mapping can be computed (the struct_asym table). Consequently, ChimeraX’s mmCIF reader skips reading tables when it can infer all of the information it needs from a table. Other application’s mmCIF readers might not compute the same information, so ChimeraX outputs the tables it infers for completeness. This is also needed for the mmCIF files to validate.
Stylized Output¶
As shown in Benchmarking readcif, stylized PDBx/mmCIF output can be read faster than unstylized output. It is also easier to visually scan fixed column width tables for interesting values. ChimeraX ouputs the chimerax_audit_syntax.case_sensitive_flag as Y to indicate that all keywords are lowercase and appear at the beginning of a line. And outputs chimerax_audit_syntax.fixed_width with just the atom_site and atom_site_anisotrop tables listed (since the those tables are typically the largest ones in the mmCIF file).
Validation¶
We tested the mmCIF output in two ways:
- Using the World Wide PDB’s online validator
- Validated the mmCIF file using the associated mmCIF dictionary
There are several software packages that will try to validate a mmCIF file using the associate dictionary. We used the mmCIF Dictionary Suite from the wwPDB, since it supports the current mmCIF dictionary (version 5).
Connectivity¶
TODO: full connectivity should optionally be generated
Problems¶
ChimeraX does not save enough information to completely regenerate some of the mmCIF tables it uses.
Heterogeneous information is discarded when reading, so it is not present when writing.
ChimeraX is only concerned about strands, so the sheet information in the struct_sheet_range table is lost. On output, the sheet identifier is given as unknown (?).
In other cases, the original mmCIF table, that is copied verbatim into the output, is non-conforming. For example, in wwPDB provided mmCIF files, the mandatory item pdbx_src_id is often missing from the entity_src_gen and entity_src_nat tables.
Only the single letter code for a residue in a chain’s sequence is kept. So, if that residue is not present in any entity with the same sequence, then the name of the residue defaults to the standard one for that letter.
Generated mmCIF Categories and Keywords¶
For reference, all of the mmCIF categories and keywords that ChimeraX outputs are listed below.
Recognized Data Categories and Keywords¶
Category Keywords (|req| = required) atom_type symbol atom_site group_PDB, id, label_entity_id, label_asym_id, auth_asym_id, pdbx_PDB_ins_code, label_seq_id, auth_seq_id, label_alt_id, type_symbol, label_atom_id, label_comp_id, Cartn_x, Cartn_y, Cartn_z, occupancy, B_iso_or_equiv, pdbx_PDB_model_num atom_site_anisotrop id, U[1]_[1], U[1]_[2], U[1]_[3], U[2]_[2], U[2]_[3], U[3]_[3] audit_conform dict_name, dict_version chimerax_audit_syntax case_sensitive_flags, fixed_width cell copied from original file chem_comp id, type, name extract from original file citation merged from original file id, title, journal_abbrev, journal_volume, year, page_first, page_last, journal_issue, pdbx_database_id_PubMed, pdbx_database_id_DOI citation_author citation_id, name, ordinal entry id entity id, type, pdbx_description entity_poly entity_id, type, nstd_monomer, pdbx_seq_one_letter_code_can entity_poly_seq entity_id, num, mon_id entity_src_gen copied from original file entity_src_nat copied from original file pdbx_poly_seq_scheme entity_id, asym_id, mon_id, seq_id, pdb_strand_id, pdb_seq_num, pdb_ins_code pdbx_struct_assembly copied from original file pdbx_struct_assembly_gen copied from original file pdbx_struct_oper_list copied from original file software name, version, location, classification, os, type, citation_id, pdbx_ordinal struct_asym id, entity_id struct_conf id, conf_type_id, beg_label_asym_id, beg_label_comp_id, beg_label_seq_id, end_label_asym_id, end_label_comp_id, end_label_seq_id, beg_auth_asym_id, beg_auth_seq_id, pdbx_beg_PDB_ins_code, end_auth_asym_id, end_auth_seq_id, pdbx_end_PDB_ins_code, struct_conf_type id struct_conn id, conn_type_id, ptnr1_label_asym_id, ptnr1_auth_asym_id, pdbx_ptnr1_PDB_ins_code, ptnr1_label_seq_id, ptnr1_auth_seq_id, pdbx_ptnr1_label_alt_id, ptnr1_label_atom_id, ptnr1_label_comp_id, ptnr1_symmetry, ptnr2_label_asym_id, ptnr2_auth_asym_id, pdbx_ptnr2_PDB_ins_code, ptnr2_label_seq_id, ptnr2_auth_seq_id, pdbx_ptnr2_label_alt_id, ptnr2_label_atom_id, ptnr2_label_comp_id, ptnr2_symmetry, pdbx_dist_value struct_conn_type id struct_sheet_range sheet_id, id, beg_label_asym_id, beg_label_comp_id, beg_label_seq_id, end_label_asym_id, end_label_comp_id, end_label_seq_id symmetry, beg_auth_asym_id, beg_auth_seq_id, pdbx_beg_PDB_ins_code, end_auth_asm_id, end_auth_seq_id, pdbx_end_PDB_ins_code symmetry copied from original file