Some ideas on the problem of "old" changes #35

HerbCaudill · 2021-12-18T12:39:04Z

HerbCaudill
Dec 18, 2021
Maintainer

The problem

I’ve been thinking about the general problem of merging in old changes to a CRDT, like where a device goes offline for a year and then shows up with changes that are “concurrent” with all the activity that's happened since then.

Two examples, one without malicious intent and the other a deliberate exploit:

Alice makes a change on her phone while disconnected. Before she ever connects, she gets a new phone and puts the old phone in a drawer. A year later, she pulls out the old phone and it reconnects. Now her old change is concurrent with a year's worth of changes, and could override any number of those changes.
Eve has authorized a "sleeper" device to be used at an opportune moment. Bob removes Eve from the admin role. On her device, Eve removes Bob from the group altogether. The change shows up as concurrent with Bob's action, and Eve wins the conflict and stays in the group.

(Both examples assume well-behaved client applications. A member using a modified client would have even more ways to cause mischief.)

This is a security concern in the context of localfirst/auth, but in general it’s also a big usability problem for distributed apps. If a long-dormant device can come online and introduce a single operation that overturns months' worth of activity, people will perceive the app as unstable — even if there's no malice and no security issues involved.

And of course it's the possibility of a change being introduced anywhere in the DAG that forces us to lug around our entire history of changes until the end of time, rather than taking a snapshot at a point in time and discarding older history.

Martin @ept tells me that this is an active area of research under the label of "causal stability". As in: Application state up to and including a given action A0 is "causally stable" if we know that we'll never see a new change that's earlier than or concurrent to A0. That could be because we know every device has seen change A0, or it could be because we know that any new changes based on anything earlier than A0 will be rejected.

Collecting and sharing sync status

To know when we have causal stability, we'll need to keep track of each device's sync status and share that information as part of the sync process.

In lf/auth, every sync message that Alice sends already includes Alice's current head. We can add to that a timestamp (more about that later), and sign those two things. So something like

head: ['ddd'], 
timestamp: 1639764818900, 
signature: 'xyz',
// ... rest of sync message

... where signature is Alice's signature of the object { head, timestamp }.

When Alice and Bob finish syncing — which happens when they both have the same head, and they both know they have the same head — Alice records Bob's head, timestamp, and signature under Bob's name in a dictionary that she stores alongside the chain:*

// sync status
{
  alice_laptop: { head: ['ddd'], time: 1639764818900, signature: 'xyz' },
  bob_laptop: { head: ['ddd'], time: 1639764989900, signature: 'qrs' },
  bob_phone: { head: ['ccc'], time: 1639763847200, signature: 'abc' },
  charlie_tablet: { head: ['aaa', 'bbb'], time: 1639762123000, signature: 'mno'}
}

*I don't think we can store this information on the chain itself, because then each sync status update would require us to sync again and we'd never finish. So I think it'd have to be kept in a separate structure, which is fine because it's more compact and we only care about the most recent value for each device.

Every time we sync, we will share our entire sync status table once, perhaps as part of the initial sync message. When Alice receives Bob's sync status table, she goes through and updates each of her entries as needed: If the head Bob has listed for Charlie is more recent than what Alice has in her table, she updates hers, otherwise she ignores it.

We can trust this information 100%, even if we receive it second-hand: Bob's self-reported head has to be accurate or he'd never be able to complete the sync process. And if we learn about Charlie's status through Bob, we know that Bob hasn't modified it because it comes with Charlie's signature.

If our sync status table includes every device used by every member of the team, then A1 is the earliest action listed there, and A0 is the first predecessor of A1 that has no concurrent actions.

We can now snapshot the application state as of A0 and discard everything in the chain up to and including A0. And we don't need to worry about a backdating attack prior to that point.

Limiting old updates by calendar time

That is good and well, but it wouldn't prevent either of the scenarios described above. We might notice that all but one device is relatively up-to-date, but we can't do anything about it.

It would be great for an app to have the option of limiting how far out of date a device can be before we require it to catch up before submitting changes. That limit L could be a day, or a month, or a year — whatever it was, the idea would be that you couldn't base a change on a head that's older than that. Instead you'd have to catch up with the latest information, and then rebase your change onto the current head.

In general we know that we can't trust people's self-reported timestamps for the purposes of determining the correct order of actions.

However, when two devices are syncing in real time, it would be reasonable to require them both to be showing approximately the same time. So when Alice receives a timestamped sync message from Bob, if his self-reported time is off by some margin — say, 5 minutes, or even an hour, doesn't need to be very precise for this purpose — she refuses to connect.

Now, we also have a timestamp attached to each action, which we haven't used for anything thus far. We still can't trust that timestamp for purposes of conflict resolution.

However, if make sure when syncing that we never accept an action timestamped in the future, we can now place an upper bound on an action's time: It can't have happened after it was first seen by someone other than the author, or after a dependent action by a different author. And a lower bound: It can't have happened before a preceding action by a different author.

This allows us to create a straightforward way to limit updates by calendar time. Suppose we set a limit of 1 month (L = 2629800000 ms). We identify A1, the latest action that has no concurrent actions and has a timestamp older than L. We then identify A0, the first action preceding A1 by a different author than A1. We will now refuse to accept new actions based on actions that precede A0. If you're syncing, and you have new changes, and you don't know about A0, you will need to sync first, and then rebase your changes on A0 or a later change.

Limiting old updates by logical time

An alternative, simpler, way to define A0 would be to ignore timestamps and calendar time altogether, and set a limit based on the absolute number of changes elapsed —- say N = 100 changes, or 1000 changes, or what have you. Here the algorithm would be something like: Working backwards from the head, count the number of actions. When that count is greater or equal to N and we're on an action with no concurrent actions, we've found A0.

One objection to this is that Eve could make a change and then stuff the chain with N inconsequential actions in order to force others to rebase on her change. To prevent that we could decide to count runs of actions by the same author as a single increment; so N would be redefined as the number of actions whose parents include an action by a different author.

At any rate, my guess is that a calendar-based limit will make more intuitive sense to most users.

HerbCaudill · 2021-12-18T14:54:59Z

HerbCaudill
Dec 18, 2021
Maintainer Author

Another twist is that we have to define what it means to say that someone's data is "out of date" when there's no central authority, and we have to decide who needs to catch up with who. Consider a couple of scenarios, with a 1-week calendar time cutoff.

We have a team with 10 members. Alice makes a single change but doesn't connect with anyone, and then goes on vacation for two weeks and leaves her device at home. During that time the rest of the group is active and stays in sync.

When Alice reconnects, we're all agreed that the device will need to rebase its one pending change on the group's state, and not the other way around.
As before, but Alice goes on a field trip in a remote area and is collecting data while offline. Alice makes a bunch of changes on her tablet over the course of two weeks. The rest of the team stays in sync during that period.

Here it's not quite as clear, but still it seems reasonable for the one person who has been working independently to rebase, rather than the 9 who have stayed in sync.
Our group consists of Alice, Bob, and Charlie. Bob last connected with Alice and Charlie two weeks ago. Alice and Charlie are synced up. Everyone has made lots of changes during that time.

Following the same logic, it's 2 vs. 1: Bob is the one who needs to catch up with Alice & Charlie.
Alice and Bob are the only members of the group. They last connected two weeks ago and haven't synced up since then, but during that time they've both been busily making changes on their own devices.

When they connect, we don't have an obvious way of deciding who should be required to rebase their changes, but we do need a deterministic way of choosing one.
We have Alice, Bob, Charlie, and Dwight. Alice and Bob are synced with each other, and Charlie and Dwight are synced with each other, but neither Alice nor Bob has connected with Charlie or Dwight for two weeks. Say Alice now connects with Charlie.

We're now in a similar situation to (4) — we can't really say that one is "behind" the other, so we need a way of choosing one.

I think there would be enough information in the sync status tables to figure out who needs to rebase just by comparing two peers' tables, since these include unfakable evidence that a peer has synced, directly or indirectly, with other peers at specific times. So perhaps we just count the number of peers each one has synced within the time limit (in this case within the last week), and whoever has been in sync with more people wins. Tiebreaker would just be something arbitrary + deterministic, like a hash of the whole sync table.

1 reply

nikgraf Dec 21, 2021

Really interesting insights! I think what you propose could work, but haven't thought it through completely.
My biggest concern right now is the UX impact. It probably could be solved with a status per change "in progress" vs "confirmed" depending on how many people the change has been synced with.

Another thought: This break if someone could arbitrary amount of members. Then is comes down to: Did the majority of "admins" that can add members actually agree? And in the end it comes down to: Who can add admins?

Who can add admins could be solved by e.g. majority of admins has to agree to promote someone to an admin. I think with these restrictions this could result in a working model. What do you think?

HerbCaudill · 2021-12-21T18:49:41Z

HerbCaudill
Dec 21, 2021
Maintainer Author

Another thought: This break if someone could arbitrary amount of members. Then is comes down to: Did the majority of "admins" that can add members actually agree? And in the end it comes down to: Who can add admins?

The admin role has unlimited powers. It's all or nothing. Perhaps "owner" or "superadmin" would be a better term. The idea is that when the founder of a group grants admin authority to another member, she is giving them complete control over the group, up to and including the right to remove any other admins, even the founder herself.*

One of the basic assumptions I've made about these teams is that all authority flows from the founder. They're not democracies: They're ruled unilaterally. If the founder delegates her power to another admin, she's 100% trusting them to act on her behalf. This assumption allows us to avoid the complexity of voting or consensus mechanisms.

We still need to come up with sensible mechanics for resolving situations where different admins do certain things concurrently, but that shouldn't lead us to conclude that admins are somehow untrusted. Anything an admin chooses to do is acceptable by definition. And while the logic for determining "who needs to catch up with whom" might involve something that looks like a vote, it's not that — it's not so much a security mechanism as it is a way of choosing the least surprising outcome in (hopefully) rare edge cases.

*I'd eventually like to be able to make this less of an all-or-nothing thing, allowing you to create roles that have some admin powers but not all.

0 replies

ept · 2022-12-30T15:41:03Z

ept
Dec 30, 2022

I've been thinking about this some more lately, in particular the idea of limiting old updates by calendar time, which seems like it would make the most sense from a user's point of view. In that context I noticed a problem that I don't know how to solve; maybe you have thought about it?

Let's say we want to reject updates that are more than one month out of date. Even if we use a clock sync protocol to limit the clock skew between devices, we could end up in the following situation:

Alice has been offline for slightly less than a month, and wants to sync her update with Bob and Charlie.
Bob's clock is running slightly slow, and so he thinks that Alice's update is just less than a month old, and so Bob accepts Alice's update.
Charlie's clock is running slightly fast, and so he thinks that Alice's update is slightly more than a month old, and so Charlie rejects Alice's update.

Now Bob and Charlie are in inconsistent states because they have accepted different sets of updates. If we can tightly bound the clock skew between devices we can reduce the probability of this situation occurring, but we cannot rule it out entirely when updates have an age that is close to the cut-off.

Maybe it's okay as long as this inconsistency is only temporary, but then there needs to be some sort of process by which Bob and Charlie eventually come to agreement on which updates to accept and which to reject. None of the algorithms I can think of are really convincing, though:

When Bob and Charlie sync and realise they have made inconsistent accept/reject decisions, they could all default to accepting the update in question. However, in the example with Eve's sleeper device, Eve could then ensure her update gets accepted by having a second sleeper device (one that generates the update, and the second one accepts it, resulting in all devices accepting her update).
Similarly to the above, but default to rejecting the update when there is disagreement. Then Eve's sleeper device could prevent any legitimate user from performing updates by telling the other devices that those updates should be rejected. In fact, Eve could stop herself from being removed by rejecting her own removal.
Some kind of majority vote would prevent one rogue device from messing things up. However, this would no longer be a CRDT since two devices that sync may remain inconsistent until they have received a response from a majority of devices, which might take a long time if devices are frequently offline. There is also the risk of a Sybil attack, whereby Eve adds lots of devices that she controls and hence commands a majority; the voting algorithm would have to protect against Sybil attacks.

Do you have any ideas on how to resolve this?

5 replies

HerbCaudill Dec 30, 2022
Maintainer Author

Let's think about this graph. The letters represent hashes and the numbers represent timestamps. Suppose our limit L is 5 time units. We want an algorithm in which Alice, Bob, and Charlie will consistently reject Eve's update K.

We don't want to rely on each actor's clock at the time of syncing, since different actors will have different clocks. Instead we can just pay attention to the timestamps on the nodes, which will be the same for everyone.

One way of expressing the problem with Eve's update is that the last common ancestor of Eve's head and our head was a long time ago, relative to one or both heads. (Not relative to Alice or Bob or Charlie's current clock.)

A bit more formally:

Let L be a maximum timestamp delta
Let N1 and N2 be two nodes in the graph, and N0 their most recent common ancestor
Let T1 and T2 be the nodes' timestamps, and T0 the common ancestor's timestamp
Let ∆1 = T1 - T0 and ∆2 = T2 - T0 (the deltas between the nodes' timestamps and the ancestor's timestamp)
Neither delta can be larger than L. If ∆1>L or ∆2>L, the graph is invalid

Now when Eve tries to sync up with Alice or anyone else, they'll calculate 9 (H) - 2 (B) = 7 which is more than 5; and 8 (K) - 2 (B) = 6 which is also more than 5; so they should each refuse to sync.

At this point I think it'd be up to the application to decide what happens next. If Eve isn't being malicious, she can stash K, resetting her graph to B, at which point she can sync with Alice, and then make a new node K' that depends on H (or any node more recent than C). Then everyone can happily sync up.

HerbCaudill Dec 30, 2022
Maintainer Author

Here's another way to frame the same idea. This discussion started out being about how to know when we have causal stability by keeping track of who's synced up to what point. If we determine that everyone knows about a given node, we can snapshot our state prior to that node and discard all its predecessors.

But we also want to put some kind of limit on how long the "eventual" in "eventual consistency" can take. So we say, after our time limit L has elapsed, we don't care if you've not synced up, we're going throw away our history anyway. So in the above example, Alice's client might decide that C is our new root and everything prior is ancient history. (Again, we decide by comparing that node's timestamp to the head's timestamp, not to our current clock.)

Now Alice clearly can't sync with Eve, because their graphs have different roots. If she wants to sync up with Alice, the only way would be for her to get Alice's starting state + graph, and then base her change(s) on the current head, or at least on a node that's still around.

ept Dec 31, 2022

Thanks for your reply. This seems like a good solution. Being able to throw away the parts of the history older than L is also a nice optimisation.

If there haven't been any updates for some time, we probably need to add a no-op update to the graph occasionally (on average once every L or so), so that the timestamp on the latest update keeps moving forward. Do you agree? Which device adds this update doesn't really matter; we could randomise the timeout to reduce the probability of several devices generating no-ops at the same time (though this wouldn't do any harm aside from a small compute and storage overhead).

It could happen that there are two subgroups that are able to sync within each subgroup, but not across subgroups (for example, two different teams going for fieldwork in different locations, syncing via local network in each location but with no long-range networking/internet access). If they are disconnected from each other for longer than L, then when they reconnect, they will refuse to sync with each other. Effectively, the group will have forked into two separate groups. I think that's probably the best outcome in this situation.

If such a fork is non-malicious and the subgroups want to merge again, one of them will have to rebase their updates on top of the other's, but it's not clear how we would choose which one to rebase and which one to leave as-is. If one subgroup consists of only a single device, then it makes sense for that one device to rebase its operations, but if the two subgroups are of a similar size, the choice is more arbitrary. (No-op updates could be discarded during rebasing.)

A device can fairly easily rebase its own updates, but rebasing an update from another device is harder because the signature can only be computed by the device that originated the update. Either several devices would need to perform the rebase in a coordinated way to produce new valid signatures on the rebased updates, or we need some kind of special handling of rebases that allows the signature on the original update to be reused. This seems like it would be possible, but also fairly complex; given that forked subgroups are probably a fairly obscure edge case for most users, maybe it's also fine to simply not support rebasing.

Then there is a question of user interface: how do we explain to the user that their device is refusing to sync with another device because of such a fork, and what options do we give the user to resolve the situation? The simplest solution would be to allow the user to discard their local updates and accept whatever the other device wants to send them (similar in effect to deleting and reinstalling the app). A more sophisticated solution would compute the rebase of the local updates and offer to apply them.

ept Jan 1, 2023

I just realised that there is still a problem with using the timestamps on updates. Consider the graph in your example, with all users (Alice, Bob, Eve) initially having updates A to G. Eve then generates update H and sends it to Alice:

Alice accepts update H since it is a well-formed part of the graph.

Concurrently (possibly on another device), Eve generates updates I and J and sends them to Bob:

Bob accepts the updates he receives from Eve, even though the updates are a bit old. Bob has not yet seen update H, and the timestamp differences between G and B (4) and between J and B (3) are less than 5, so this graph is still acceptable.

However, if Alice and Bob now try to sync, they will refuse each other, because they now have branches that are too far apart. Even though Alice and Bob were never disconnected for a long time, Eve was able to trick them into becoming diverged. It seems difficult to stop Eve from being able to do this kind of thing, because all of Eve's operations are legitimate. And if Eve can prevent other users' devices from syncing with each other, that's quite a big potential for disruption – for example, if one user tries to remove Eve, Eve might be able to stop that removal from propagating to the rest of the group.

HerbCaudill Jan 3, 2023
Maintainer Author

Hmm right. And the same thing could happen innocently (Charlie gives Alice the older change while Dwight gives Bob the newer one).

Maybe the periodic no-op nodes would prevent the older changes from being accepted? Have to think about this more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some ideas on the problem of "old" changes #35

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Some ideas on the problem of "old" changes #35

HerbCaudill Dec 18, 2021 Maintainer

The problem

Collecting and sharing sync status

Limiting old updates by calendar time

Limiting old updates by logical time

Replies: 3 comments · 6 replies

HerbCaudill Dec 18, 2021 Maintainer Author

nikgraf Dec 21, 2021

HerbCaudill Dec 21, 2021 Maintainer Author

ept Dec 30, 2022

HerbCaudill Dec 30, 2022 Maintainer Author

HerbCaudill Dec 30, 2022 Maintainer Author

ept Dec 31, 2022

ept Jan 1, 2023

HerbCaudill Jan 3, 2023 Maintainer Author

HerbCaudill
Dec 18, 2021
Maintainer

Replies: 3 comments 6 replies

HerbCaudill
Dec 18, 2021
Maintainer Author

HerbCaudill
Dec 21, 2021
Maintainer Author

ept
Dec 30, 2022

HerbCaudill Dec 30, 2022
Maintainer Author

HerbCaudill Dec 30, 2022
Maintainer Author

HerbCaudill Jan 3, 2023
Maintainer Author