Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockexchange uses merkle root and index to fetch blocks #566

Merged
merged 16 commits into from
Nov 14, 2023

Conversation

tbekas
Copy link
Contributor

@tbekas tbekas commented Oct 12, 2023

Resolves #511
Resolves #498

Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick preliminary review.

codex/stores/treereader.nim Outdated Show resolved Hide resolved
codex/stores/repostore.nim Outdated Show resolved Hide resolved
codex/codex.nim Outdated Show resolved Hide resolved
@tbekas tbekas marked this pull request as ready for review October 17, 2023 21:32
@tbekas tbekas force-pushed the blockexchange-uses-merkle-tree branch from f365139 to 78a4d79 Compare October 17, 2023 21:45
codex/node.nim Outdated Show resolved Hide resolved
codex/streams/storestream.nim Outdated Show resolved Hide resolved
codex/utils/asynciter.nim Show resolved Hide resolved
codex/stores/treereader.nim Outdated Show resolved Hide resolved
codex/blockexchange/engine/pendingblocks.nim Outdated Show resolved Hide resolved
codex/blockexchange/engine/pendingblocks.nim Outdated Show resolved Hide resolved
codex/blockexchange/engine/pendingblocks.nim Outdated Show resolved Hide resolved
codex/blockexchange/engine/engine.nim Outdated Show resolved Hide resolved
codex/blockexchange/engine/engine.nim Outdated Show resolved Hide resolved
codex/stores/blockstore.nim Outdated Show resolved Hide resolved
codex/blocktype.nim Outdated Show resolved Hide resolved
Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some more comments as I move along.

@dryajov
Copy link
Contributor

dryajov commented Oct 18, 2023

One thing I'm seeing right now, is that we're have a treeCid and a merkleRoot. We're storing the tree as a block and derive a hash and cid from it. This create indirection and duplication. I think there is an easy way of overcoming this by simply naming the on disc merkle tree block with the merkle root hash, this doesn't violate content addressability and tamper resistance at all, since we can easily verify that the contents of the file are indeed correct by rebuilding the root hash from the stored hashes.

@@ -37,91 +39,44 @@ type
cid*: Cid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can simplify things a lot, if we make the cid and BlockAddress?

Copy link
Contributor Author

@tbekas tbekas Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block can have different addresses, but if you're looking for .address property that's based on cid, there's already a helper proc:

proc address*(b: Block): BlockAddress =
  BlockAddress(leaf: false, cid: b.cid)

@tbekas
Copy link
Contributor Author

tbekas commented Oct 19, 2023

@dryajov I agree with your comment. treeCid: Cid and treeRoot: MultiHash are redundant. Fixing it has a consequence, which is having additional flow just for trees. Whenever we read or transmit a block, we verify bock data using hash function and check if hash matches expected Cid. For trees we wouldn't use this verification flow, so we would have to handle this special case somehow - we would have to recognize that we're reading a tree and use different verification flow.

In order to not introduce this additional flow, I would do the following:
We no longer store trees as blocks - we would continue to store them in the key/value datastore however, but under the trees subdirectory (until we decide to store proofs only). There's no longer treeCid: Cid in BlockAddress, there's just treeRoot: MultiHash instead. What do you think?

@dryajov
Copy link
Contributor

dryajov commented Oct 19, 2023

Whenever we read or transmit a block, we verify bock data using hash function and check if hash matches expected Cid.

I think this is consequence/limitation of the current multihash interfaces, there is no reason why we could not have a digest for merkelized structures, which instead of performing block oriented hashing, uses a merkel tree/trie as it's underlying digest method.

I think, the cleanest way to approach it would be to add a digest for our merkelized structures to multihash in libp2p. This isn't hard, but it requires to move some of our code around, including extracting the merkle tree implementation. I'll need to think if there is an intermediate way that allows us to handle this more gracefully, without having to modify too many dependencies. One way, would be to forego the pure multihash interfaces and put it behind yet another abstraction that uses either multihash or a custom function, this might be useful for the poseidon hash as well. It would be a temporary solution as eventually we want to consolidate these things under multiformats, but it might be too many dependencies to modify right now.

So, a good intermediate solution could be to:

  • Create a higher level abstraction that allows us to choose which digest/hash mechanism we want to use - something behind multihash or an external one
  • Put multihash and other digest methods behind that
  • Use the abstraction instead of the pure multihash everywhere
  • We do need to add a multicodec for it, to tie everything together and be able to continue to interop with the underlying libp2p primitives, but that is a minor change and we can do it in our own fork for now

I'm not settled on it yet, but it might be one pragmatic approach. Another one would be to simply bite the bullet, extract the merkle tree into it's own package and add everything under libp2p, we already kinda have to work with a fork either way, at least for the next couple of months.

We no longer store trees as blocks - we would continue to store them in the key/value datastore however, but under the trees subdirectory (until we decide to store proofs only). There's no longer treeCid: Cid in BlockAddress, there's just treeRoot: MultiHash instead. What do you think?

I'm not sure if this is required with what I described above, I really like the idea of storing merkle trees as blocks as it keeps the underlying flow pretty consistent, I'd like to avoid special flows as much as possible and consolidate everything under the same underlying primitives. I think what we lack at the repo/blockstore right now is the ability to stream a single block from disk/network, in case it is too large, say several megabytes in size, but this is easily fixed by extending the interface. It does require getting rid of protobuf, to something that a) offers binary consistency (protobuf does not, different implementations will produce different byte by byte outputs) b) allows streaming the contents, which again protobuf doesn't support at all. There are a wealth of serde formats out there, that we eventually want to look into more closely, but for now, protobuf is OK, sans the limitations.

All that said, I'm not yet settled on any specific approach, but one thing is clear, we need to get rid of the duplicated identifiers.

@dryajov
Copy link
Contributor

dryajov commented Oct 23, 2023

I merged #593, which simply removes the storageproofs directory and tests for now. If you can't get it to merge cleanly, just delete the directory and tests...

@dryajov
Copy link
Contributor

dryajov commented Oct 23, 2023

I rebased locally, and it does so cleanly, so it should work without issues.

Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments as I go through.

iterator items*(self: BlocksIter): Future[?Cid] =
while not self.finished:
yield self.next()
method getBlock*(self: BlockStore, treeCid: Cid, index: Natural): Future[?!Block] {.base.} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that Natural isn't a great choise for index, since it wouldn't support negatives and potentially BackwardsIndex.

Also, is there any reason not to use BlockAddress and unify the interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad1. But backwards index is a different access pattern and hence a different parameter type. This is what we have in AsyncHeapQueue

proc `[]`*[T](heap: AsyncHeapQueue[T], i: Natural) : T {.inline.} =
  ## Access the i-th element of ``heap`` by order from first to last.
  ## ``heap[0]`` is the first element, ``heap[^1]`` is the last element.
  heap.queue[i]

proc `[]`*[T](heap: AsyncHeapQueue[T], i: BackwardsIndex) : T {.inline.} =
  ## Access the i-th element of ``heap`` by order from first to last.
  ## ``heap[0]`` is the first element, ``heap[^1]`` is the last element.
  heap.queue[len(heap.queue) - int(i)]

Ad2. There's a method to get block by address (line 46), but in some cases we don't have an address object in the scope, instead we have separately treeCid and index variables. We could allocate the memory and create the BlockAddress object, but: 1. that seem a bit wasteful and 2. it's an additional inconvenience to the client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ad1. But backwards index is a different access pattern and hence a different parameter type. This is what we have in AsyncHeapQueue

OK, not a big deal - we can't use [] operator with async and we'll unlikely to use a BackwardsIndex either way.

Ad2. There's a method to get block by address (line 46), but in some cases we don't have an address object in the scope, instead we have separately treeCid and index variables. We could allocate the memory and create the BlockAddress object, but: 1. that seem a bit wasteful and 2. it's an additional inconvenience to the client.

Hmm, that seems like a problem of not using BlockAddress consistently rather than anything else?

Copy link
Contributor Author

@tbekas tbekas Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that seems like a problem of not using BlockAddress consistently rather than anything else?

So in couple of places we have a following pattern:

for i in 0..<manifest.blocksCount:
   await getBlock(manifest.treeCid, i)

Would you like it to be replaced by following code?

for i in 0..<manifest.blocksCount:
   let address = BlockAddress.init(manifest.treeCid, i)
   await getBlock(address)

If so I could get rid of the method getBlock(treeCid, index), but in my opinion the 2nd code is simply worse because it's:

  • less efficient (new object created on each iteration)
  • more verbose than necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand the reasoning, but IMO having two slightly distinct interfaces will just make the codebase less consistent - IMO, consistency trumps everything else.

We could add an initializer from a tuple for example

init(_: type (Cid, Natural)): BlockAddress = ...

The point is that we stick with one convention and follow that, otherwise we'll end up with (slightly) distinct looking flows, which will make the codebase harder to read and maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I understand the reasoning and I agree that unification has certain benefits (if at some point we will remove fetching individual blocks by just cid, we will have less to refactor).

Also there's more to unify if we go this path:

method ensureExpiry*(self: BlockStore, cid: Cid, expiry: SecondsSince1970)

method delBlock*(self: BlockStore, cid: Cid)
method delBlock*(self: BlockStore, treeCid: Cid, index: Natural)

method hasBlock*(self: BlockStore, cid: Cid)
method hasBlock*(self: BlockStore, tree: Cid, index: Natural)

proc contains*(self: BlockStore, blk: Cid)
proc contains*(self: BlockStore, address: BlockAddress)

I could do this as a part of this PR or in a separate one. Wdyt?

codex/stores/repostore.nim Outdated Show resolved Hide resolved
codex/stores/repostore.nim Show resolved Hide resolved
codex/stores/repostore.nim Outdated Show resolved Hide resolved
codex/blockexchange/engine/engine.nim Outdated Show resolved Hide resolved
p.blocks.withValue(bd.address, blockReq):
trace "Resolving block", address = bd.address

if not blockReq.handle.finished:
Copy link
Contributor

@dryajov dryajov Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want to use the deref operator, [] for ref/pointer types, nim will try to infere the type, but it doesn't work always...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it works in this case, should I add [] anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it doesn't work always and it's also easy to miss that we're really handling a pointer when reading the code.

codex/erasure/erasure.nim Outdated Show resolved Hide resolved
tests/codex/testerasure.nim Outdated Show resolved Hide resolved
tests/codex/testnode.nim Outdated Show resolved Hide resolved
Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better. I'll try to go through it once more, but so far I think this is meargeable.

@dryajov
Copy link
Contributor

dryajov commented Nov 9, 2023

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Analysis

  • 🎯 Main theme: This PR focuses on modifying the block exchange mechanism to use merkle root and index to fetch blocks. It also includes changes to various related components such as block handlers, block presence handlers, and block request functions.
  • 📝 PR summary: The PR modifies the block exchange mechanism to use merkle root and index to fetch blocks instead of the previous method. It also includes changes to various related components such as block handlers, block presence handlers, and block request functions. The PR also includes updates to tests to accommodate these changes.
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: Yes
  • ⏱️ Estimated effort to review [1-5]: 4, because the PR includes changes to multiple components and requires understanding of the block exchange mechanism and merkle trees. The changes are also not trivial and require careful review to ensure correctness.
  • 🔒 Security concerns: No

PR Feedback

  • 💡 General suggestions: The PR seems to be well-structured and the changes are logically grouped. The use of merkle root and index for block fetching could potentially improve the efficiency of the block exchange mechanism. However, it's important to ensure that these changes do not introduce any new vulnerabilities or bugs. It would also be beneficial to include more detailed comments explaining the logic behind the changes, especially for complex parts of the code.

  • 🤖 Code feedback:

    • relevant file: codex/blockexchange/engine/engine.nim
      suggestion: Consider refactoring the 'requestBlock' function to reduce its complexity and improve readability. The function seems to be doing too many things, which can make it harder to understand and maintain. [important]
      relevant line: proc requestBlock*(b: BlockExcEngine, address: BlockAddress, timeout = DefaultBlockTimeout): Future[Block] {.async.} =

    • relevant file: codex/blockexchange/engine/engine.nim
      suggestion: It might be beneficial to add error handling or logging in the 'monitorBlockHandle' function to catch and log any potential errors during the execution of the function. This can help in debugging and identifying issues in the future. [medium]
      relevant line: proc monitorBlockHandle(b: BlockExcEngine, handle: Future[Block], address: BlockAddress, peerId: PeerId) {.async.} =

    • relevant file: codex/blockexchange/engine/engine.nim
      suggestion: The 'validateBlockDelivery' function could be refactored to improve readability. Consider breaking down the function into smaller, more manageable functions. [medium]
      relevant line: proc validateBlockDelivery(b: BlockExcEngine, bd: BlockDelivery): ?!void =

    • relevant file: codex/blockexchange/engine/engine.nim
      suggestion: Consider adding more detailed logging in the 'blocksDeliveryHandler' function. This can help in understanding the flow of the function and can be useful for debugging purposes. [medium]
      relevant line: proc blocksDeliveryHandler*(b: BlockExcEngine, peer: PeerId, blocksDelivery: seq[BlockDelivery]) {.async.} =

How to use

Tag me in a comment '@CodiumAI-Agent' and add one of the following commands:
/review [-i]: Request a review of your Pull Request. For an incremental review, which only considers changes since the last review, include the '-i' option.
/describe: Modify the PR title and description based on the contents of the PR.
/improve [--extended]: Suggest improvements to the code in the PR. Extended mode employs several calls, and provides a more thorough feedback.
/ask <QUESTION>: Pose a question about the PR.
/update_changelog: Update the changelog based on the PR's contents.

To edit any configuration parameter from configuration.toml, add --config_path=new_value
For example: /review --pr_reviewer.extra_instructions="focus on the file: ..."
To list the possible configuration parameters, use the /config command.

@dryajov
Copy link
Contributor

dryajov commented Nov 9, 2023

@CodiumAI-Agent /improve --extended

@tbekas
Copy link
Contributor Author

tbekas commented Nov 10, 2023

I see that integration tests are not successful, I will try to fix it on Monday.

@dryajov
Copy link
Contributor

dryajov commented Nov 10, 2023

I see that integration tests are not successful, I will try to fix it on Monday.

It seems to be failing on marketplace related stuff, which has been very flaky lately.

let iter = Iter.fromSlice(0..<manifest.blocksCount)
.map((i: int) => node.blockStore.getBlock(BlockAddress.init(manifest.treeCid, i)))

for batchNum in 0..<batchCount:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentialy throw, we probably want this to in the try.

Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets get this merged, we can address outstanding issues in subsequent PRs

@dryajov dryajov mentioned this pull request Nov 14, 2023
@tbekas tbekas merged commit 2396c4d into master Nov 14, 2023
8 checks passed
@tbekas tbekas deleted the blockexchange-uses-merkle-tree branch November 14, 2023 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade Merkle Tree to avoid second-preimage attacks Merkelizing Block Data
6 participants