Skip to content

Lease based Lock Protocol

Kristoffer Just Andersen edited this page Jan 29, 2018 · 5 revisions

High-Level Problem Description

We want to have a 'resource' (say, a memory cell with read and write) take part in a distributed system. We want to control races on this resource by means of mutual exclusion. A traditional shared memory approach is a locking mechanism, but a traditional lock doesn't quite carry into the distributed paradigm, cf. "How to do distributed locking" by Martin Kleppmann.

Kleppmann argues that there are essentially two reasons for wanting locking in a distributed system:

  1. efficiency - e.g. avoiding retries, recomputing expensive functions etc. his example for worst case scenario for improper locking are decreases in latency, increases in server costs and e.g. a user receiving a mail notification twice.
  2. correctness - Here improper locking entails everything from data loss, inconsistent system states resulting in improper dosages of medicine etc.

He then outlines a scenario where two clients of the locking mechanism can come to believe they are holding the same lock:

Bug

// THIS CODE IS BROKEN
function writeData(filename, data) {
 var lock = lockService.acquireLock(filename);
 if (!lock) {
     throw 'Failed to acquire lock';
 }

 try {
     var file = storage.readFile(filename);
     var updated = updateContents(file, data);
     storage.writeFile(filename, updated);
 } finally {
     lock.release();
 }
}

The faulty scenario arises as follows:

Faulty Scenario 1

Here he mentions that a distributed lock typically has a timeout, to prevent a crashed node from blocking the entire system - non-standard from plain shared-memory concurrent locks.

The scenario cannot be avoided by checking of valid expiry etc. Latency of instructions can vary at any point in the program due to Garbage Collection or scheduling, meaning any checks of timings are the wrong approach to the problem.

The system cannot make any assumptions on time or timings. I guess the only information we are allowed to work with is the observed order of effects. The only kind of time one may rely on is a timeout.

What is the distinction between a client deciding a timeout has not expired and a server determining that a timeout has expired? In this particular instance it's clear that the server can err on the side of caution by letting timeouts run longer than promised, but is there something general to be said?

Solution

... Note, this requires the storage serve to take an active part in the locking protocol by checking tokens

Interesting.

Summary

In a setting where defensive behaviour and fault tolerance is desired, the proposed solution is thus to introduce a lease-based locking mechanism. It differs from traditional shared-memory concurrency locks as follows:

  1. has a time-out
  2. issues a fencing token for proof of lease-hold

The challenge, from a DiSeL perspective is to see if such a scenario can be built, and to what extent it could be built by composing existing, separate protocols for a resource and a locking mechanism.

Resource

A Server provides access to a memory cell containing a primitive value which can be read or written.

Messages

  • ClientRead
  • ClientWrite(n)
  • ServerReadAck(n)
  • ServerWriteAck

Protocol

ServerState(n). --ClientRead--> ServerState(n), ServerReadAck(n)
                --ClientWrite(m)--> ServerState(m), ServerWriteAck

Lock

A Server provides clients with lease tokens, that can be used to end a lease. While their lease is active, they have exclusive access to some other resource.

Messages

  • ClientLockReq
  • ServerLockAck(t)
  • ClientReleaseReq(t)
  • ServerReleaseOK
  • ServerReleaseFail

Protocol

The server must maintain the next available lease token, as well as the currently highest served lease.

ServerState(t,n) --ClientLockReq--> ServerState(t+1,n), ServerLockAck(t)
                 --ClientReleaseReq(m), m > n--> ServerState(t, m), ServerReleaseOK
                 --ClientReleaseReq(m), m <= n--> ServerState(t, n), ServerReleaseFail

Properties

  • The state of the server is monotonically increasing
  • At any point in time, ServerState(t, n) satisfies t > n.