Catalog Indexes #3

cryptoquick · 2022-09-23T23:05:08Z

Catalogs are a flat file snapshot of a database state. They are loaded into memory as a BTreeMap and flushed to disk only after a flush method is explicitly called, thus allowing a form of batching, and implementation flexibility over managed behavior, which in some ways is preferable over an embedded database, for example.

Catalogs will facilitate chunking of large files into, say, 16MB chunks, and also, chunks help with parallelization of codec tasks over num_cpus.

Catalogs should have a struct that contains a BTreeMap, expose all its methods, and include new and flush utility methods on it.

Maybe in a later issue, a flush can be performed on drop (but only after checking explicitly flushed state), but this is a lot of work and introduces some magical behavior.

The text was updated successfully, but these errors were encountered:

dr-orlovsky · 2023-01-24T02:32:11Z

They are loaded into memory as a BTreeMap and flushed to disk only after a flush method is explicitly called

This implies a persistence of a process in the memory.

What about situation when a library is used, alike sqlite or git? When the data are persisted on a disk and not in a memory?

Catalogs will facilitate chunking of large files into, say, 16MB chunks,

I think more important is to allow to preserve a semantic file structure, instead of a fixed-size chunks. Like in macOS packages and tarballs vs let's say torrents.

Maybe in a later issue, a flush can be performed on drop

Nope, since the process may not be given a time to complete the flush

cryptoquick · 2023-01-24T12:10:31Z

Good points. The complication is that the database file is encoded and hashed each time, so it can also be replicated. Maybe this could use a format like c3 that doesn't use hashes, but still uses compression and encryption. Every write would block until all new data had flushed to disk, perhaps even appended to the file. This would be similar to a write-ahead log. Then the BTreeMap could use keys that point to offsets within the file.

I think more important is to allow to preserve a semantic file structure, instead of a fixed-size chunks. Like in macOS packages and tarballs vs let's say torrents.

This happens at a higher level than this library, Carbonado files aren't really the same as tarballs, they're for individual files, and if a file is very large, it should be chunked into separate segment files, so it can be processed in parallel. For multiple files, they're tracked using a catalog that has metadata such as filenames, what segments are needed and in what order. This is similar to the function of IPLD. A torrent file also can include multiple files, so that doesn't apply.

cryptoquick self-assigned this Feb 28, 2023

cryptoquick changed the title ~~Catalogs~~ Catalog Indexes Feb 28, 2023

cryptoquick mentioned this issue Feb 28, 2023

Catalog Format v0 #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalog Indexes #3

Catalog Indexes #3

cryptoquick commented Sep 23, 2022 •

edited

Loading

dr-orlovsky commented Jan 24, 2023

cryptoquick commented Jan 24, 2023

Catalog Indexes #3

Catalog Indexes #3

Comments

cryptoquick commented Sep 23, 2022 • edited Loading

dr-orlovsky commented Jan 24, 2023

cryptoquick commented Jan 24, 2023

cryptoquick commented Sep 23, 2022 •

edited

Loading