Blockexchange uses merkle root and index to fetch blocks #566

tbekas · 2023-10-12T14:55:24Z

Resolves #511
Resolves #498

dryajov

Just a quick preliminary review.

codex/stores/treereader.nim

codex/stores/repostore.nim

codex/codex.nim

Co-authored-by: Dmitriy Ryajov <dryajov@gmail.com> Signed-off-by: Tomasz Bekas <tomasz.bekas@gmail.com>

codex/node.nim

codex/streams/seekablestorestream.nim

codex/streams/storestream.nim

codex/utils/asynciter.nim

codex/stores/treereader.nim

codex/blockexchange/engine/pendingblocks.nim

codex/blockexchange/engine/engine.nim

codex/stores/blockstore.nim

codex/blocktype.nim

dryajov

Adding some more comments as I move along.

codex/blockexchange/engine/engine.nim

dryajov · 2023-10-18T22:49:10Z

One thing I'm seeing right now, is that we're have a treeCid and a merkleRoot. We're storing the tree as a block and derive a hash and cid from it. This create indirection and duplication. I think there is an easy way of overcoming this by simply naming the on disc merkle tree block with the merkle root hash, this doesn't violate content addressability and tamper resistance at all, since we can easily verify that the contents of the file are indeed correct by rebuilding the root hash from the stored hashes.

dryajov · 2023-10-18T22:51:08Z

codex/blocktype.nim

@@ -37,91 +39,44 @@ type
    cid*: Cid


I think this can simplify things a lot, if we make the cid and BlockAddress?

Block can have different addresses, but if you're looking for .address property that's based on cid, there's already a helper proc:

proc address*(b: Block): BlockAddress = BlockAddress(leaf: false, cid: b.cid)

tbekas · 2023-10-19T12:19:10Z

@dryajov I agree with your comment. treeCid: Cid and treeRoot: MultiHash are redundant. Fixing it has a consequence, which is having additional flow just for trees. Whenever we read or transmit a block, we verify bock data using hash function and check if hash matches expected Cid. For trees we wouldn't use this verification flow, so we would have to handle this special case somehow - we would have to recognize that we're reading a tree and use different verification flow.

In order to not introduce this additional flow, I would do the following:
We no longer store trees as blocks - we would continue to store them in the key/value datastore however, but under the trees subdirectory (until we decide to store proofs only). There's no longer treeCid: Cid in BlockAddress, there's just treeRoot: MultiHash instead. What do you think?

dryajov · 2023-10-19T13:14:18Z

Whenever we read or transmit a block, we verify bock data using hash function and check if hash matches expected Cid.

I think this is consequence/limitation of the current multihash interfaces, there is no reason why we could not have a digest for merkelized structures, which instead of performing block oriented hashing, uses a merkel tree/trie as it's underlying digest method.

I think, the cleanest way to approach it would be to add a digest for our merkelized structures to multihash in libp2p. This isn't hard, but it requires to move some of our code around, including extracting the merkle tree implementation. I'll need to think if there is an intermediate way that allows us to handle this more gracefully, without having to modify too many dependencies. One way, would be to forego the pure multihash interfaces and put it behind yet another abstraction that uses either multihash or a custom function, this might be useful for the poseidon hash as well. It would be a temporary solution as eventually we want to consolidate these things under multiformats, but it might be too many dependencies to modify right now.

So, a good intermediate solution could be to:

Create a higher level abstraction that allows us to choose which digest/hash mechanism we want to use - something behind multihash or an external one
Put multihash and other digest methods behind that
Use the abstraction instead of the pure multihash everywhere
We do need to add a multicodec for it, to tie everything together and be able to continue to interop with the underlying libp2p primitives, but that is a minor change and we can do it in our own fork for now

I'm not settled on it yet, but it might be one pragmatic approach. Another one would be to simply bite the bullet, extract the merkle tree into it's own package and add everything under libp2p, we already kinda have to work with a fork either way, at least for the next couple of months.

We no longer store trees as blocks - we would continue to store them in the key/value datastore however, but under the trees subdirectory (until we decide to store proofs only). There's no longer treeCid: Cid in BlockAddress, there's just treeRoot: MultiHash instead. What do you think?

I'm not sure if this is required with what I described above, I really like the idea of storing merkle trees as blocks as it keeps the underlying flow pretty consistent, I'd like to avoid special flows as much as possible and consolidate everything under the same underlying primitives. I think what we lack at the repo/blockstore right now is the ability to stream a single block from disk/network, in case it is too large, say several megabytes in size, but this is easily fixed by extending the interface. It does require getting rid of protobuf, to something that a) offers binary consistency (protobuf does not, different implementations will produce different byte by byte outputs) b) allows streaming the contents, which again protobuf doesn't support at all. There are a wealth of serde formats out there, that we eventually want to look into more closely, but for now, protobuf is OK, sans the limitations.

All that said, I'm not yet settled on any specific approach, but one thing is clear, we need to get rid of the duplicated identifiers.

dryajov · 2023-10-23T15:03:10Z

I merged #593, which simply removes the storageproofs directory and tests for now. If you can't get it to merge cleanly, just delete the directory and tests...

dryajov · 2023-10-23T15:08:37Z

I rebased locally, and it does so cleanly, so it should work without issues.

dryajov

A few comments as I go through.

dryajov · 2023-11-07T22:03:52Z

codex/stores/blockstore.nim

-iterator items*(self: BlocksIter): Future[?Cid] =
-  while not self.finished:
-    yield self.next()
+method getBlock*(self: BlockStore, treeCid: Cid, index: Natural): Future[?!Block] {.base.} =


I'm thinking that Natural isn't a great choise for index, since it wouldn't support negatives and potentially BackwardsIndex.

Also, is there any reason not to use BlockAddress and unify the interface?

Ad1. But backwards index is a different access pattern and hence a different parameter type. This is what we have in AsyncHeapQueue

proc `[]`*[T](heap: AsyncHeapQueue[T], i: Natural) : T {.inline.} = ## Access the i-th element of ``heap`` by order from first to last. ## ``heap[0]`` is the first element, ``heap[^1]`` is the last element. heap.queue[i] proc `[]`*[T](heap: AsyncHeapQueue[T], i: BackwardsIndex) : T {.inline.} = ## Access the i-th element of ``heap`` by order from first to last. ## ``heap[0]`` is the first element, ``heap[^1]`` is the last element. heap.queue[len(heap.queue) - int(i)]

Ad2. There's a method to get block by address (line 46), but in some cases we don't have an address object in the scope, instead we have separately treeCid and index variables. We could allocate the memory and create the BlockAddress object, but: 1. that seem a bit wasteful and 2. it's an additional inconvenience to the client.

Ad1. But backwards index is a different access pattern and hence a different parameter type. This is what we have in AsyncHeapQueue

OK, not a big deal - we can't use [] operator with async and we'll unlikely to use a BackwardsIndex either way.

Ad2. There's a method to get block by address (line 46), but in some cases we don't have an address object in the scope, instead we have separately treeCid and index variables. We could allocate the memory and create the BlockAddress object, but: 1. that seem a bit wasteful and 2. it's an additional inconvenience to the client.

Hmm, that seems like a problem of not using BlockAddress consistently rather than anything else?

Hmm, that seems like a problem of not using BlockAddress consistently rather than anything else?

So in couple of places we have a following pattern:

for i in 0..<manifest.blocksCount: await getBlock(manifest.treeCid, i)

Would you like it to be replaced by following code?

for i in 0..<manifest.blocksCount: let address = BlockAddress.init(manifest.treeCid, i) await getBlock(address)

If so I could get rid of the method getBlock(treeCid, index), but in my opinion the 2nd code is simply worse because it's:

less efficient (new object created on each iteration)

more verbose than necessary

Yeah, I understand the reasoning, but IMO having two slightly distinct interfaces will just make the codebase less consistent - IMO, consistency trumps everything else.

We could add an initializer from a tuple for example

init(_: type (Cid, Natural)): BlockAddress = ...

The point is that we stick with one convention and follow that, otherwise we'll end up with (slightly) distinct looking flows, which will make the codebase harder to read and maintain.

Alright. I understand the reasoning and I agree that unification has certain benefits (if at some point we will remove fetching individual blocks by just cid, we will have less to refactor).

Also there's more to unify if we go this path:

method ensureExpiry*(self: BlockStore, cid: Cid, expiry: SecondsSince1970) method delBlock*(self: BlockStore, cid: Cid) method delBlock*(self: BlockStore, treeCid: Cid, index: Natural) method hasBlock*(self: BlockStore, cid: Cid) method hasBlock*(self: BlockStore, tree: Cid, index: Natural) proc contains*(self: BlockStore, blk: Cid) proc contains*(self: BlockStore, address: BlockAddress)

I could do this as a part of this PR or in a separate one. Wdyt?

codex/stores/repostore.nim

codex/blockexchange/engine/engine.nim

dryajov · 2023-11-08T23:53:22Z

codex/blockexchange/engine/pendingblocks.nim

+    p.blocks.withValue(bd.address, blockReq):
+      trace "Resolving block", address = bd.address
+
+      if not blockReq.handle.finished:


you want to use the deref operator, [] for ref/pointer types, nim will try to infere the type, but it doesn't work always...

But it works in this case, should I add [] anyway?

Yeah, it doesn't work always and it's also easy to miss that we're really handling a pointer when reading the code.

codex/erasure/erasure.nim

tests/codex/blockexchange/protobuf/testpresence.nim

tests/codex/testerasure.nim

tests/codex/testnode.nim

dryajov

Looking much better. I'll try to go through it once more, but so far I think this is meargeable.

dryajov · 2023-11-09T05:39:02Z

@CodiumAI-Agent /review

CodiumAI-Agent · 2023-11-09T05:40:07Z

PR Analysis

🎯 Main theme: This PR focuses on modifying the block exchange mechanism to use merkle root and index to fetch blocks. It also includes changes to various related components such as block handlers, block presence handlers, and block request functions.
📝 PR summary: The PR modifies the block exchange mechanism to use merkle root and index to fetch blocks instead of the previous method. It also includes changes to various related components such as block handlers, block presence handlers, and block request functions. The PR also includes updates to tests to accommodate these changes.
📌 Type of PR: Enhancement
🧪 Relevant tests added: Yes
⏱️ Estimated effort to review [1-5]: 4, because the PR includes changes to multiple components and requires understanding of the block exchange mechanism and merkle trees. The changes are also not trivial and require careful review to ensure correctness.
🔒 Security concerns: No

PR Feedback

💡 General suggestions: The PR seems to be well-structured and the changes are logically grouped. The use of merkle root and index for block fetching could potentially improve the efficiency of the block exchange mechanism. However, it's important to ensure that these changes do not introduce any new vulnerabilities or bugs. It would also be beneficial to include more detailed comments explaining the logic behind the changes, especially for complex parts of the code.
🤖 Code feedback:
- relevant file: codex/blockexchange/engine/engine.nim
  suggestion: Consider refactoring the 'requestBlock' function to reduce its complexity and improve readability. The function seems to be doing too many things, which can make it harder to understand and maintain. [important]
  relevant line: proc requestBlock*(b: BlockExcEngine, address: BlockAddress, timeout = DefaultBlockTimeout): Future[Block] {.async.} =
- relevant file: codex/blockexchange/engine/engine.nim
  suggestion: It might be beneficial to add error handling or logging in the 'monitorBlockHandle' function to catch and log any potential errors during the execution of the function. This can help in debugging and identifying issues in the future. [medium]
  relevant line: proc monitorBlockHandle(b: BlockExcEngine, handle: Future[Block], address: BlockAddress, peerId: PeerId) {.async.} =
- relevant file: codex/blockexchange/engine/engine.nim
  suggestion: The 'validateBlockDelivery' function could be refactored to improve readability. Consider breaking down the function into smaller, more manageable functions. [medium]
  relevant line: proc validateBlockDelivery(b: BlockExcEngine, bd: BlockDelivery): ?!void =
- relevant file: codex/blockexchange/engine/engine.nim
  suggestion: Consider adding more detailed logging in the 'blocksDeliveryHandler' function. This can help in understanding the flow of the function and can be useful for debugging purposes. [medium]
  relevant line: proc blocksDeliveryHandler*(b: BlockExcEngine, peer: PeerId, blocksDelivery: seq[BlockDelivery]) {.async.} =

How to use

Tag me in a comment '@CodiumAI-Agent' and add one of the following commands:
/review [-i]: Request a review of your Pull Request. For an incremental review, which only considers changes since the last review, include the '-i' option.
/describe: Modify the PR title and description based on the contents of the PR.
/improve [--extended]: Suggest improvements to the code in the PR. Extended mode employs several calls, and provides a more thorough feedback.
/ask <QUESTION>: Pose a question about the PR.
/update_changelog: Update the changelog based on the PR's contents.

To edit any configuration parameter from configuration.toml, add --config_path=new_value
For example: /review --pr_reviewer.extra_instructions="focus on the file: ..."
To list the possible configuration parameters, use the /config command.

dryajov · 2023-11-09T06:16:33Z

@CodiumAI-Agent /improve --extended

tbekas · 2023-11-10T13:40:33Z

I see that integration tests are not successful, I will try to fix it on Monday.

dryajov · 2023-11-10T15:14:42Z

I see that integration tests are not successful, I will try to fix it on Monday.

It seems to be failing on marketplace related stuff, which has been very flaky lately.

dryajov · 2023-11-14T00:37:49Z

codex/node.nim

+  let iter = Iter.fromSlice(0..<manifest.blocksCount)
+    .map((i: int) => node.blockStore.getBlock(BlockAddress.init(manifest.treeCid, i)))
+
+  for batchNum in 0..<batchCount:


This can potentialy throw, we probably want this to in the try.

dryajov

Lets get this merged, we can address outstanding issues in subsequent PRs

tbekas requested a review from dryajov October 12, 2023 14:55

tbekas mentioned this pull request Oct 12, 2023

Storing and retrieving data using merkle trees #541

Closed

dryajov requested changes Oct 17, 2023

View reviewed changes

codex/stores/treereader.nim Outdated Show resolved Hide resolved

codex/stores/repostore.nim Outdated Show resolved Hide resolved

codex/codex.nim Outdated Show resolved Hide resolved

tbekas marked this pull request as ready for review October 17, 2023 21:32

tbekas requested review from gmega and benbierens October 17, 2023 21:32

tbekas and others added 4 commits October 17, 2023 23:43

Blockexchange uses merkle root and index to fetch blocks

8c1d97d

Links the network store getTree to the local store.

10d6456

Update codex/stores/repostore.nim

85cef0e

Co-authored-by: Dmitriy Ryajov <dryajov@gmail.com> Signed-off-by: Tomasz Bekas <tomasz.bekas@gmail.com>

Rework erasure.nim to include recent cleanup

78a4d79

tbekas force-pushed the blockexchange-uses-merkle-tree branch from f365139 to 78a4d79 Compare October 17, 2023 21:45

Revert accidential changes to lib versions

ac2fc71

jessiebroke added the client label Oct 18, 2023

jessiebroke assigned tbekas Oct 18, 2023

tbekas commented Oct 18, 2023

View reviewed changes

codex/node.nim Outdated Show resolved Hide resolved

tbekas commented Oct 18, 2023

View reviewed changes

codex/streams/seekablestorestream.nim Outdated Show resolved Hide resolved

dryajov requested changes Oct 18, 2023

View reviewed changes

dryajov reviewed Oct 18, 2023

View reviewed changes

codex/blockexchange/engine/engine.nim Outdated Show resolved Hide resolved

dryajov reviewed Oct 18, 2023

View reviewed changes

markspanbroek mentioned this pull request Oct 19, 2023

Update Request: remove PoR, add merkle root #590

Merged

Addressing review comments

252b445

tbekas added 3 commits November 3, 2023 21:17

Storing proofs instead of trees

68f76c1

Merge branch 'master' into blockexchange-uses-merkle-tree

67a0fb4

Fix a comment

5277a12

tbekas added 2 commits November 6, 2023 12:40

Fix broken tests

b542cce

Merge branch 'master' into blockexchange-uses-merkle-tree

8e40caf

dryajov reviewed Nov 8, 2023

View reviewed changes

tbekas commented Nov 8, 2023

View reviewed changes

codex/blockexchange/engine/engine.nim Show resolved Hide resolved

dryajov reviewed Nov 8, 2023

View reviewed changes