Lost Updates and Write Skew

1. Preventing Lost Updates

  • Besides dirty writes, there are several other kinds of race conditions that can occur between concurrently writing transactions. The best known of these is the lost update problem. It’s a common problem, which can occur if an application reads some value from the database, modifies it, and writes back the modified value (a read-modify-write cycle). If two transactions do this concurrently, one of the modifications can be lost, because the second write does not include the first modification

1.1 Atomic write operations

1
UPDATE counters SET value = value + 1 WHERE key = 'foo';
  • Many databases provide atomic update operations, which remove the need to implement read-modify-write cycles in application code. They are usually the best solution if your code can be expressed in terms of those operations
  • Similarly, document databases such as MongoDB provide atomic operations for making local modifications to a part of a JSON document, and Redis provides atomic operations for modifying data structures such as priority queues. Not all writes can easily be expressed in terms of atomic operations—for example, updates to a wiki page involve arbitrary text editing
  • Atomic operations are usually implemented by taking an exclusive lock on the object when it is read so that no other transaction can read it until the update has been applied. This technique is sometimes known as cursor stability. Another option is to simply force all atomic operations to be executed on a single thread
  • Unfortunately, object-relational mapping frameworks make it easy to accidentally write code that performs unsafe read-modify-write cycles instead of using atomic operations provided by the database. That’s not a problem if you know what you are doing, but it is potentially a source of subtle bugs that are difficult to find by testing

1.2 Explicit locking

1
2
3
4
5
6
BEGIN TRANSACTION;
-- FOR UPDATE clause indicates that the database should take a lock on all rows returned by this query
SELECT * FROM figures WHERE name = 'robot' AND game_id = 222 FOR UPDATE;
-- Check whether move is valid, then update the position
UPDATE figures SET position = 'c4' WHERE id = 1234;
COMMIT;
  • Another option for preventing lost updates is for the application to explicitly lock objects that are going to be updated, then the application can perform a read-modify-write cycle, and if any other transaction tries to concurrently read the same object, it is forced to wait until the first read-modify-write cycle has completed

1.3 Automatically detecting lost updates

  • Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle
    • An advantage of this approach is that databases can perform this check efficiently in conjunction with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates. Some authors argue that a database must prevent lost updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot isolation under this definition
    • Lost update detection is a great feature, because it doesn’t require application code to use any special database features—you may forget to use a lock or an atomic operation and thus introduce a bug, but lost update detection happens automatically and is thus less error-prone

1.4 Compare-and-set

1
2
-- This may or may not be safe, depending on the database implementation
UPDATE wiki_pages SET content = 'new content' WHERE id = 1234 AND content = 'old content';
  • In databases that don’t provide transactions, you sometimes find an atomic compare-and-set operation. The purpose of this operation is to avoid lost updates by allowing an update to happen only if the value has not changed since you last read it. If the current value does not match what you previously read, the update has no effect, and the read-modify-write cycle must be retried
  • However, if the database allows the WHERE clause to read from an old snapshot, this statement may not prevent lost updates, because the condition may be true even though another concurrent write is occurring. Check whether your database’s compare-and-set operation is safe before relying on it

1.5 Conflict resolution and replication

  • In replicated databases, preventing lost updates takes on another dimension: since they have copies of the data on multiple nodes, and the data can potentially be modified concurrently on different nodes
  • Locks and compare-and-set operations assume that there is a single up-to-date copy of the data. However, databases with multi-leader or leaderless replication usually allow several writes to happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a single up-to-date copy of the data. Thus, techniques based on locks or compare-and-set do not apply in this context
  • A common approach in such replicated databases is to allow concurrent writes to create several conflicting versions of a value (also known as siblings), and to use application code or special data structures to resolve and merge these versions after the fact
  • Atomic operations can work well in a replicated context, especially if they are commutative (i.e., you can apply them in a different order on different replicas, and still get the same result). For example, incrementing a counter or adding an element to a set are commutative operations. That is the idea behind Riak 2.0 datatypes, which prevent lost updates across replicas. When a value is concurrently updated by different clients, Riak automatically merges together the updates in such a way that no updates are lost
  • On the other hand, the last write wins (LWW) conflict resolution method is prone to lost updates. Unfortunately, LWW is the default in many replicated databases

2. Write Skew and Phantoms

  • There are some subtler race conditions that can occur between concurrent writes, such as write skew

2.1 Characterizing write skew

  • Write skew is neither a dirty write nor a lost update, because the two transactions are updating two different objects. You can think of write skew as a generalization of the lost update problem. Write skew can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). In the special case where different transactions update the same object, you get a dirty write or lost update anomaly (depending on the timing)
  • There are various different ways of preventing lost updates. With write skew, our options are more restricted:
    • Atomic single-object operations don’t help, as multiple objects are involved
    • The automatic detection of lost updates that you find in some implementations of snapshot isolation unfortunately doesn’t help either: write skew is not automatically detected in PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL Server’s snapshot isolation level. Automatically preventing write skew requires true serializable isolation
    • Some databases allow you to configure constraints, which are then enforced by the database (e.g., uniqueness, foreign key constraints, or restrictions on a particular value). Most databases do not have built-in support for some constraints, you may be able to implement them with triggers or materialized views, depending on the database
    • If you can’t use a serializable isolation level, the second-best option is probably to explicitly lock the rows that the transaction depends on
1
2
3
4
5
BEGIN TRANSACTION;
-- lock all rows returned by this query
SELECT * FROM doctors WHERE on_call = true AND shift_id = 1234 FOR UPDATE;
UPDATE doctors SET on_call = false WHERE name = 'Alice' AND shift_id = 1234;
COMMIT;

2.2 Phantoms causing write skew

  • There are some situations in which write skew can occur:
    • Meeting room booking system: Two users book for the same meeting room at the same time
    • Multiplayer game: Two players move different figures to the same position on the board
    • Claiming a username: Two users try to create accounts with the same username at the same time
    • Preventing double-spending: Two spending items are inserted concurrently that together cause the balance to go negative
  • All of these examples follow a similar pattern:
    1. A SELECT query checks whether some requirement is satisfied by searching for rows that match some search condition
    2. Depending on the result of the first query, the application code decides how to continue (perhaps to go ahead with the operation, or perhaps to report an error to the user and abort)
    3. If the application decides to go ahead, it makes a write (INSERT, UPDATE, or DELETE) to the database and commits the transaction. The effect of this write changes the precondition of the decision of step 2
    • The steps may occur in a different order. For example, you could first make the write, then the SELECT query, and finally decide whether to abort or commit based on the result of the query
  • In some cases, we could make the transaction safe and avoid write skew by locking the rows in step 1 (SELECT FOR UPDATE). However, other cases check for the absence of rows matching some search condition, and the write adds a row matching the same condition. If the query in step 1 doesn’t return any rows, SELECT FOR UPDATE can’t attach locks to anything
  • This effect, where a write in one transaction changes the result of a search query in another transaction, is called a phantom. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the examples, phantoms can lead to particularly tricky cases of write skew

2.3 Materializing conflicts

  • If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we can artificially introduce a lock object into the database. For example, in the meeting room booking case we could create a table of time slots and rooms. Now a transaction that wants to create a booking can lock (SELECT FOR UPDATE) the rows in the table that correspond to the desired room and time period
  • This approach is called materializing conflicts, because it takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database. Unfortunately, it can be hard and error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control mechanism leak into the application data model. For those reasons, materializing conflicts should be considered a last resort if no alternative is possible. A serializable isolation level is much preferable in most cases

References

  • Designing Data-Intensive Application