readcif – a C++11 CIF and mmCIF parser¶
- Author
Greg Couch
- Organization
- Contact
- Copyright
© Copyright 2014-2017 by the Regents of the University of California. All Rights reserved.
- Last modified
2017-3-9
readcif is a C++11 library for quickly extracting data from mmCIF and CIF files. It fully conforms to the CIF 1.1 standard for data files, and can be easily extended to handle CIF dictionaries. In addition, it supports stylized PDBx/mmCIF files for even quicker parsing.
License¶
The readcif library is available with an open source license:
Copyright © 2014 The Regents of the University of California. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or other materials provided with the distribution.
Redistributions must acknowledge that this software was originally developed by the UCSF Resource for Biocomputing, Visualization, and Informatics with support from the National Institutes of Health R01-GM129325.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OF THE UNIVERSITY OF CALIFORNIA BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Usage¶
CIF files are essentially a text version of a database. Each table in the database corresponds to a category, with named columns, and the rows contain the values. CIF tags are a concatenation of the category name and the column name.
readcif provides a base class, CIFFile
,
that should be subclassed to implement an application’s specific needs.
Virtual functions are used to support CIF reserved words,
that way the application can choose what to do if there is more than one
data block or handle dictionaries.
And callback functions are used to extract the data the
application wants from a category.
Finally, the category callback functions need to provide a set of
callback functions to parse the value of interesting columns.
So, in pseudo-code, an application’s parser would look like:
class ExtractCIF: public readcif::CIFFile {
public:
ExtractCIF() {
initialize category callbacks
}
in each category callback:
create parse value callback vector for interesting columns
while (parse_row(parse_value_callbacks))
continue;
};
and be used by:
ExtractCIF extract;
const char* whole_file = ....
extract.parse_file(filename)
See the associated example code
file
for a working example
that reads a subset of the atom_site data from a PDB mmCIF file.
PDBx/mmCIF Styling¶
PDBx/mmCIF files from the World-Wide PDB are formatted for fast parsing. If a CIF file is known to use PDBx/mmCIF stylized formatting, then parsing can be up to 4 times faster. readcif supports taking advantage of the PDBx/mmCIF styling with an API and code.
PDBx/mmCIF styling constrains the CIF format by:
Outside of a category:
CIF reserved words and tags only appear immediately after an ASCII newline.
CIF reserved words are in lowercase.
Tags are case sensitive (category names and item names are expected to match the case given in the associated dictionary, e.g., mmcif_pdbx.dic).
Support for this is controlled with the
CIFFile::set_PDBx_keywords()
function.Inside a category:
All columns are left-aligned.
Each row of data has all of the columns.
All rows have trailing spaces so they are the same length.
The end of the category’s data values is terminated by a comment line.
Support for this is controlled with the
CIFFile::set_PDBx_fixed_width_columns()
function.
The example code
shows how a derived class would turn on stylized parsing.
The audit_conform category is examined for explicit references to pdbx_keywords_flag and pdbx_fixed_width_columns.
And if they are present, they control the options.
Otherwise, a heuristic is used: if the dict_name is “mmcif_pdbx.dic”
and dict_version is greater than 4,
then it is assumed that there is keyword styling and that the atom_site and the atom_site_anistrop categories have fixed width columns.
C++ API¶
All of the public symbols are in the readcif namespace.
-
type StringVector¶
A std::vector of std::string’s.
-
int is_whitespace(char c)¶
is_whitespace and is_not_whitespace are inline functions to determine if a character is CIF whitespace or not. They are similar to the C/C++ standard library’s isspace function, but only recognize ASCII HT (9), LF (10), CR (13), and SPACE (32) as whitespace characters. They are not inverses because ASCII NUL (0) is both not is_whitespace and not is_not_whitespace.
-
int is_not_whitespace(char c)¶
See
is_whitespace()
.
-
double str_to_float(const char *s)¶
Non-error checking inline function to convert a string to a floating point number. It is similar to the C/C++ standard library’s atof function, but returns NaN if no digits are found. Benchmarked by itself, it is slower than atof, but is empirically much faster when used in shared libraries. This is probably due to CPU cache behavior, but needs further investigation.
-
int str_to_int(const char *s)¶
Non-error inline function to convert a string to an integer. It is similar to the C/C++ standard library’s atoi function. Same rational for use as
str_to_float()
. Returns zero if no digits are found.
-
class CIFFile¶
The CIFFile is designed to be subclassed by an application to extract the data the application is interested in.
Public section:
-
type ParseCategory¶
A typedef for std::function<void (bool in_loop)>.
-
void register_category(const std::string &category, ParseCategory callback, const StringVector &dependencies = StringVector())¶
Register a callback function for a particular category.
- Parameters
category – name of category
callback – function to retrieve data from category
dependencies – a list of categories that must be parsed before this category.
A null callback function, removes the category. Dependencies must be registered first. A category callback function can find out which category it is processing with
CIFFile::category()
.
-
void set_unregistered_callback(ParseCategory callback)¶
Set callback function that will be called for unregistered categories.
-
void parse_file(const char *filename)¶
- Parameters
filename – Name of file to be parsed
If possible, memory-map the given file to get the buffer to hand off to
parse()
. On POSIX systems, files whose size is a multiple of the system page size, have to be read into an allocated buffer instead.
-
void parse(const char *buffer)¶
Parse the input and invoke registered callback functions
- Parameters
buffer – Null-terminated text of the CIF file
The text must be terminated with a null character. A common technique is to memory map a file and pass in the address of the first character. The whole file is required to simplify backtracking since data tables may appear in any order in a file. Stylized parsing is reset each time
parse()
is called.
-
void set_PDBx_keywords(bool stylized)¶
Turn on and off PDBx/mmCIF keyword styling as described in PDBx/mmCIF Styling.
- Parameters
stylized – if true, assume PDBx/mmCIF keyword style
This is reset every time
CIFFile::parse()
orCIFFile::parse_file()
is called. It may be switched on and off at any time, e.g., within a particular category callback function.
-
bool PDBx_keywords() const¶
Return if the PDBx_keywords flag is set. See
set_PDBx_keywords()
.
-
void set_PDBx_fixed_width_columns(const std::string &category)¶
Turn on PDBx/mmCIF fixed width column parsing for a given category as described in PDBx/mmCIF Styling.
- Parameters
category – name of category
This option must be set in each category callback that is needed. This option is ignored if
PDBx_keywords()
is false. This is not a global option because there is no reliable way to detect if the preconditions are met for each record without losing all of the speed advantages.
-
bool has_PDBx_fixed_width_columns() const¶
Return if there were any fixed width column categories specified. See
set_PDBx_fixed_width_columns()
.
-
bool PDBx_fixed_width_columns() const¶
Return if the current category has fixed width columns. See
set_PDBx_fixed_width_columns()
.
-
int get_column(const char *name, bool required = false)¶
- Parameters
tag – column name to search for
required – true if tag is required
Search the current categories tags to figure out which column the name corresponds to. If the name is not present, then -1 is returned unless it is required, then an error is thrown.
-
type ParseValue1¶
typedef std::function<void (const char* start)> ParseValue1;
-
type ParseValue2¶
typedef std::function<void (const char* start, const char* end)> ParseValue2;
-
class ParseColumnn¶
-
int column_offset¶
The column offset for a given tag, returned by
get_column()
.
-
bool need_end¶
true if the end of the column needed – not needed for numbers, since all columns are terminated by whitespace.
-
ParseValue1 func1¶
The function to call if
need_end
is false.
-
ParseValue2 func2¶
The function to call if
need_end
is true.
-
ParseColumn(int c, ParseValue1 f)¶
Set
column_offset
andfunc1
.
-
ParseColumn(int c, ParseValue2 f)¶
Set
column_offset
andfunc2
.
-
int column_offset¶
-
type ParseValues¶
typedef std::vector<ParseColumn> ParseValues;
-
bool parse_row(ParseValues &pv)¶
Parse a single row of a table
- Parameters
pv – The per-column callback functions
- Returns
if a row was parsed
The category callback functions should call
parse_row()
: to parse the values for columns it is interested in. If in a loop,parse_row()
: should be called until it returns false, or to skip the rest of the values, just return from the category callback. The first timeparse_row()
is called for a category, pv will be sorted in ascending order. Columns with negative offsets are skipped.
-
StringVector &parse_whole_category()¶
Return complete contents of a category as a vector of strings.
- Returns
vector of strings
-
void parse_whole_category(ParseValue2 func)¶
Tokenize complete contents of category and call function for each item in it.
- Parameters
func – callback function
-
const std::string &version()¶
- Returns
the version of the CIF file if it is given
For mmCIF files it is typically empty.
-
const std::string &category()¶
- Returns
the category that is currently being parsed
Only valid within a
ParseCategory
callback.
-
const std::string &block_code()¶
- Returns
the data block code that is currently being parsed
Only valid within a
ParseCategory
callback andfinished_parse()
.
-
const StringVector &colnames()¶
- Returns
the set of column names for the current category
Only valid within a
ParseCategory
callback.
-
bool multiple_rows() const¶
- Returns
if current category may have multiple rows
-
size_t line_number() const¶
- Returns
current line number
-
std::runtime_error error(const std::string &text)¶
- Parameters
text – the error message
- Returns
a exception with ” on line #” appended
- Rtype
std::runtime_error
Localize error message with the current line number within the input. # is the current line number.
Protected section:
-
void data_block(const std::string &name)¶
- Parameters
name – name of data block
data_block is a virtual function that is called whenever a new data block is found. Defaults to being ignored. Replace in subclass if needed.
-
void save_frame(const std::string &code)¶
- Parameters
code – the same frame code
save_fame is a virtual function that is called when a save frame header or terminator is found. It defaults to throwing an exception. It should be replaced if the application were to try to parse a CIF dictionary.
-
void global_block()¶
global_block is a virtual function that is called whenever the global_ reserved word is found. It defaults to throwing an exception. In CIF files, global_ is unused. However, some CIF-like files, e.g., the CCP4 monomer library, use the global_ keyword.
-
void reset_parse()¶
reset_parse is a virtual function that is called whenever the parse function is called. For example, PDB stylized parsing can be turned on here.
-
void finished_parse()¶
finished_parse is a virtual function that is called whenever the parse function has successfully finished parsing.
-
type ParseCategory¶