-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new mmCIF-compatible binary format, e.g. for trajectories #31
Comments
We can split the file in two - a "static" section and a "dynamic" section. The static section can use a BinaryCIF-like encoding (binary, compact, stuffed into a MessagePack blob). This would cover categories such as Entity and would expected to be read completely into memory when the file is parsed. The dynamic section would be a set of variable-size vectors of fixed size records containing fields that are omitted from the static section (the set of fields, such as To read a file (for example), atoms would first be constructed with dummy coordinates by reading the If model compositions can vary (e.g. in multistate modeling) multiple such static-dynamic pairs (one per composition) will be needed. |
Here is a proof of concept for storing mmCIF tables in an HDF5 file. This could be used to store an entire entry, or it could be used to store just the trajectory information (
HDF5 groups allow for combining "static" and "dynamic" information so that trajectories can be efficiently stored. For example, the following creates an HDF5 file containing the
To construct the equivalent mmCIF representation, the Note that this does not fill in If multiple models of differing composition are provided in the file, they could in turn be placed in separate HDF5 nested groups. |
Also, no reason why the tables couldn't overlap and cover the same model numbers, e.g. if some subset of the system optimizes just x,y,z coordinates and some other subset also varies the radii. What this current proposal does not (efficiently) handle though, would be models where some subset of the atoms remain fixed (and thus have duplicated x,y,z coordinates in each model). |
This could be made more flexible by modifying the
HDF5 supports enumerated types. These could be used to more efficiently (using 1- or 2-byte integers per data item) store strings like ATOM/HETATM or residue/atom/element names, and could also cover mmCIF ?/. special values. Since the intention is to provide only a subset of mmCIF tables in HDF5 (e.g. @tomgoddard suggests that we also consider Zarr as an HDF5 alternative (although their support for C++ and other non-Python languages is still minimal). Also probably a good idea to get buy-in from other producers of ensembles (e.g. Rosetta, Haddock) and visualizers (e.g. Mol*). |
Add support for read/write of a file format that (ideally)
Such a format would allow IMP folks to ditch the old RMF format as the "working format", and easily convert the resulting models to mmCIF. This may also be a practical solution for those that need to deposit huge files, bigger than is really practical for current mmCIF.
Conventional mmCIF fails points 2-4. We can't easily add an extra model to an mmCIF file without rewriting the entire file (since various IDs would have to be updated, and the data is gathered per-table rather than per-model). We also can't read a single trajectory frame without scanning the whole file.
MMTF (#11) has its own data model, so fails point 1. Both MMTF and BinaryCIF support a number of encoding mechanisms to generate compact files (point 3) but these render the file non-seekable (e.g. runlength encoding of the
ATOM
/HETATM
field in theatom_site
table necessitates reading and unpacking the entire thing to determine whether a given atom isATOM
orHETATM
).Fast parsing probably necessitates a binary format.
Proposal: use HDF5 as a binary container format. Each mmCIF category would map to an HDF5 table, which should be trivially seekable and extendable. This won't be as compact as BinaryCIF or MMTF (although HDF5 does support compression). To address this we can
ATOM
orHETATM
) with a suitably small integer (just a 0 or 1 bool type in this case).The text was updated successfully, but these errors were encountered: