Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog Indexes #3

Open
cryptoquick opened this issue Sep 23, 2022 · 2 comments
Open

Catalog Indexes #3

cryptoquick opened this issue Sep 23, 2022 · 2 comments
Assignees

Comments

@cryptoquick
Copy link
Member

cryptoquick commented Sep 23, 2022

Catalogs are a flat file snapshot of a database state. They are loaded into memory as a BTreeMap and flushed to disk only after a flush method is explicitly called, thus allowing a form of batching, and implementation flexibility over managed behavior, which in some ways is preferable over an embedded database, for example.

Catalogs will facilitate chunking of large files into, say, 16MB chunks, and also, chunks help with parallelization of codec tasks over num_cpus.

Catalogs should have a struct that contains a BTreeMap, expose all its methods, and include new and flush utility methods on it.

Maybe in a later issue, a flush can be performed on drop (but only after checking explicitly flushed state), but this is a lot of work and introduces some magical behavior.

@dr-orlovsky
Copy link

They are loaded into memory as a BTreeMap and flushed to disk only after a flush method is explicitly called

This implies a persistence of a process in the memory.

What about situation when a library is used, alike sqlite or git? When the data are persisted on a disk and not in a memory?

Catalogs will facilitate chunking of large files into, say, 16MB chunks,

I think more important is to allow to preserve a semantic file structure, instead of a fixed-size chunks. Like in macOS packages and tarballs vs let's say torrents.

Maybe in a later issue, a flush can be performed on drop

Nope, since the process may not be given a time to complete the flush

@cryptoquick
Copy link
Member Author

Good points. The complication is that the database file is encoded and hashed each time, so it can also be replicated. Maybe this could use a format like c3 that doesn't use hashes, but still uses compression and encryption. Every write would block until all new data had flushed to disk, perhaps even appended to the file. This would be similar to a write-ahead log. Then the BTreeMap could use keys that point to offsets within the file.

I think more important is to allow to preserve a semantic file structure, instead of a fixed-size chunks. Like in macOS packages and tarballs vs let's say torrents.

This happens at a higher level than this library, Carbonado files aren't really the same as tarballs, they're for individual files, and if a file is very large, it should be chunked into separate segment files, so it can be processed in parallel. For multiple files, they're tracked using a catalog that has metadata such as filenames, what segments are needed and in what order. This is similar to the function of IPLD. A torrent file also can include multiple files, so that doesn't apply.

@cryptoquick cryptoquick self-assigned this Feb 28, 2023
@cryptoquick cryptoquick changed the title Catalogs Catalog Indexes Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants