-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catalog Indexes #3
Comments
This implies a persistence of a process in the memory. What about situation when a library is used, alike sqlite or git? When the data are persisted on a disk and not in a memory?
I think more important is to allow to preserve a semantic file structure, instead of a fixed-size chunks. Like in macOS packages and tarballs vs let's say torrents.
Nope, since the process may not be given a time to complete the flush |
Good points. The complication is that the database file is encoded and hashed each time, so it can also be replicated. Maybe this could use a format like c3 that doesn't use hashes, but still uses compression and encryption. Every write would block until all new data had flushed to disk, perhaps even appended to the file. This would be similar to a write-ahead log. Then the BTreeMap could use keys that point to offsets within the file.
This happens at a higher level than this library, Carbonado files aren't really the same as tarballs, they're for individual files, and if a file is very large, it should be chunked into separate segment files, so it can be processed in parallel. For multiple files, they're tracked using a catalog that has metadata such as filenames, what segments are needed and in what order. This is similar to the function of IPLD. A torrent file also can include multiple files, so that doesn't apply. |
Catalogs are a flat file snapshot of a database state. They are loaded into memory as a BTreeMap and flushed to disk only after a flush method is explicitly called, thus allowing a form of batching, and implementation flexibility over managed behavior, which in some ways is preferable over an embedded database, for example.
Catalogs will facilitate chunking of large files into, say, 16MB chunks, and also, chunks help with parallelization of codec tasks over
num_cpus
.Catalogs should have a struct that contains a BTreeMap, expose all its methods, and include
new
andflush
utility methods on it.Maybe in a later issue, a flush can be performed on drop (but only after checking explicitly flushed state), but this is a lot of work and introduces some magical behavior.
The text was updated successfully, but these errors were encountered: