Authors: John MacCormick,Chris Bishop
So, the to-do list trick gives us atomic transactions, which in turn guarantee consistency. This is a key ingredient in our canonical example: an efficient and completely reliable database for online banking. We are not there yet, however. Consistency does not, by itself, yield adequate efficiency or reliability. When combined with the locking techniques to be described shortly, the to-do list trick maintains consistency even when thousands of customers are simultaneously accessing the database. This
does
yield tremendous efficiency, because many customers can be served at once. And the to-do list trick also provides a good measure of reliability, since it prevents inconsistencies. Specifically, the to-do list trick precludes data
corruption
, but does not eliminate data
loss.
Our next database trick—the prepare-then-commit trick—will produce significant progress toward the goal of preventing any loss of data.
THE PREPARE-THEN-COMMIT TRICK FOR REPLICATED DATABASES
Our journey through ingenious database techniques continues with an algorithm we'll call the “prepare-then-commit trick.” To motivate this trick, we need to understand two more facts about databases: first, they are often
replicated
, which means that multiple copies of the database are stored in different places; and second, database transactions must sometimes be canceled, which is also called “rolling back” or “aborting” a transaction. We'll briefly cover these two concepts before moving on to the prepare-then-commit trick.
Replicated Databases
The to-do list trick allows databases to recover from certain types of crashes, by completing or rolling back any transactions that were in progress at the time of the crash. But this assumes all the data that was saved before the crash is still there. What if the computer's hard drive is permanently broken and some or all of the data is lost? This is just one of many ways that a computer can suffer from permanent data loss. Other causes include software bugs (in the database program itself or in the operating system) and hardware failures. Any of these problems can cause a computer to overwrite data that you thought was safely stored on the hard drive, wiping it out and replacing it with garbage. Clearly, the to-do list trick can't help us here.
However, data loss is simply not an option in some circumstances. If your bank loses your account information, you will be extremely upset, and the bank could face serious legal and financial penalties. The same goes for a stockbroking firm that executes an order you've placed, but then loses the details of the sale. Indeed, any company with substantial online sales (eBay and Amazon being prime examples) simply cannot afford to lose or corrupt any customer information. But in a data center with thousands of computers, many components (especially hard drives) fail every day. The data on these components
is
lost every single day. How can your bank keep your data safe in the face of this onslaught?
The obvious, and widely used, solution is to maintain two or more copies of the database. Each copy of the database is called a
replica
, and the set of all copies taken together is called a
replicated database.
Often, the replicas are geographically separated (perhaps in different data centers that are hundreds of miles apart), so that if one of them is wiped out by a natural disaster, another replica is still available.
I once heard a computer company executive describe the experiences of the company's customers after the September 11, 2001 terrorist attacks on the twin towers of New York's World Trade Center. The computer company had five major customers in the twin towers, and all were running geographically replicated databases. Four of the five customers were able to continue their operations essentially uninterrupted on surviving database replicas. The fifth customer, unfortunately, had one replica in each tower and lost both! This customer could only resume operations after restoring its database from off-site archival backups.
Note that a replicated database behaves quite differently to the familiar concept of keeping a “backup” of some data. A backup is a snapshot of some data
at a particular time—for
manual backups, the snapshot is taken at the time you run your backup program, whereas automated backups often take a snapshot of a system at a particular time on a weekly or daily basis, such as every morning at 2 a.m. In other words, a backup is a complete duplicate of some files, or a database, or anything else for which you need a spare copy.
But a backup is, by definition, not necessarily up-to-date: if some changes are made after a backup is performed, those changes are not saved anywhere else. In contrast, a replicated database keeps all copies of the database in sync at all times. Every time the slightest change is made to any entry in the database, all of the replicas must make that change immediately.
Clearly, replication is an excellent way to guard against lost data. But replication has dangers, too: it introduces yet another type of possible inconsistency. What are we going to do if a replica somehow ends up with data that differs from another replica? Such replicas are inconsistent with each other, and it may be difficult or impossible to determine which replica has the correct version of the data. We will return to this issue after investigating how to roll back transactions.
Rolling Back Transactions
At the risk of being a little repetitive, let's try to recall exactly what a transaction is: it's a set of changes to a database that must
all
take place to guarantee the database remains consistent. In the earlier discussion of transactions, we were mostly concerned with making sure that a transaction would complete even if the database crashed in the middle of the transaction.
But it turns out that sometimes it is impossible to complete a transaction for some reason. For example, perhaps the transaction involves adding a large amount of data to the database, and the computer runs out of disk space halfway through the transaction. This is a very rare, but nevertheless important, scenario.
A much more common reason for failing to complete a transaction relates to another database concept called
locking.
In a busy database, there are usually many transactions executing at the same time. (Imagine what would happen if your bank permitted only one customer to transfer money at any one time—the performance of this online banking system would be truly appalling.) But it is often important that some part of the database remains frozen during a transaction. For example, if transaction
A
is updating an entry to record that Rosina is now friends with Jingyi, it would be disastrous if a simultaneously running transaction
B
deleted Jingyi from the database altogether. Therefore, transaction
A
will “lock” the part of the database containing Jingyi's information. This means the data is frozen, and no other transaction can change it. In most databases, transactions can lock individual rows or columns, or entire tables. Obviously, only one transaction can lock a particular part of the database at any one time. Once the transaction completes successfully, it “unlocks” all of the data it has locked, and after this point other transactions are free to alter the previously frozen data.
Deadlock: When two transactions,
A
and
B
, both try to lock the same rows—but in the opposite order—they become deadlocked, and neither can proceed.
At first this seems like an excellent solution, but it can lead to a very nasty situation that computer scientists call a
deadlock
, as demonstrated in the figure above. Let's suppose that two long transactions,
A
and B, are running simultaneously. Initially, as in the top panel of the figure, none of the rows in the database are locked. Later,
as shown in the middle panel,
A
locks the row containing Marie's information, and
B
locks the row containing Pedro's information. Some time after this,
A
discovers that it needs to lock Pedro's row, and
B
discovers that it needs to lock Marie's row—this situation is represented in the bottom panel of the figure. Note that
A
now needs to lock Pedro's row, but only one transaction can lock any row at one time, and
B
has already locked Pedro's row! So
A
will need to wait until
B
finishes. But
B
can't finish until it locks Marie's row, which is currently locked by A. So
B
will need to wait until
A
finishes.
A
and
B
are
deadlocked
, because each must wait for the other to proceed. They will be stuck forever, and these transactions will never complete.
Computer scientists have studied deadlocks in great detail, and many databases periodically run a special procedure for deadlock detection. When a deadlock is found, one of the deadlocked transactions is simply canceled, so that the other can proceed. But note that, just as when we run out of disk space in the middle of a transaction, this requires the ability to abort or “roll back” the transaction that has already been partially completed. So we now know of at least two reasons why a transaction might need to be rolled back. There are many others, but there's no need for us to go into details. The fundamental point is that transactions frequently fail to complete for unpredictable reasons.
Rollback can be achieved using a slight tweak to the to-do list trick: the write-ahead log must contain enough additional information to
undo
each operation if necessary. (This contrasts with the earlier description, in which we emphasized that each log entry contains enough information to
redo
the operation after a crash.) This is easy to achieve in practice. In fact, in the simple examples we examined, the undo information and redo information are identical. An entry like “Change Zadie checking from $800 to $600” can easily be “undone”—by simply changing Zadie's checking balance from $600 to $800. To summarize: if a transaction needs to be rolled back, the database program can just work backward through the write-ahead log (i.e., the to-do list), reversing each operation in that transaction.
The Prepare-Then-Commit Trick
Now let's think about the problem of rolling back transactions in a
replicated
database. The big issue here is that one of the replicas might encounter a problem that requires rollback, while the others do not. For example, it's easy to imagine that one replica runs out of disk space while the others still have space available.
A simple analogy will help here. Suppose that you and three friends would all like to see a recently released movie together. To make things interesting, let's set this story in the 1980s, before the days of e-mail, so the movie trip is going to have to be organized by telephone. How do you go about it? One possible approach is as follows. Decide on a day and time for the movie that work for you, and—as far as you know—are likely to be suitable for your friends too. Let's suppose you choose Tuesday at 8 p.m. The next step is to call one of your friends and ask if he or she is free on Tuesday at 8. If the answer is yes, you'll say something like “great, please pencil that in, and I'll call you back later to confirm.” Then you'll call the next friend and do the same thing. Finally, you call the third and final friend with the same offer. If everyone is available on Tuesday at 8, you make the final decision to confirm the event and call back your friends to let them know.
That was the easy case. What happens if one of the friends is not available on Tuesday at 8? In this case, you will need to “roll back” all of the work done so far and start again. In reality, you would probably call each friend and immediately propose a new day and time, but to keep things as simple as possible here, let's instead assume that you call each friend and say “Sorry, Tuesday at 8 is no good, please erase that from your calendar, and I'll get back to you soon with a new proposal.” Once this is done, you can start the whole procedure all over again.
Notice that there are two distinct phases in your strategy for organizing the movie outing. In the first phase, the date and time have been proposed but are not yet 100% certain. Once you find out that the proposal is feasible for everyone,
you
know that the date and time are now 100% certain, but everyone else does not. Therefore, there is a second phase in which you call back all of your friends to confirm. Alternatively, if one or more friends were unable to make it, the second phase consists of calling back everyone to cancel. Computer scientists call this the
two-phase commit protocol
; we'll call it the “prepare-then-commit trick.” The first phase is called the “prepare” phase. The second phase is either a “commit” phase or an “abort” phase, depending on whether the initial proposal has been accepted by everyone or not.
Interestingly, there is a notion of database locking in this analogy. Although we didn't explicitly discuss it, each of your friends makes an implicit promise when they pencil in the movie outing: they are promising not to schedule something else for Tuesday at 8. Until they hear back from you with a confirmation or cancellation, that slot on the calendar is “locked” and cannot be altered by any other “transaction.” For example, what should happen if someone else calls your friend, sometime after the first phase but before the second phase, to propose watching a basketball game on Tuesday at 8? Your friend should say something like “Sorry, but I might have another appointment at that time. Until that appointment is finalized, I can't give you a firm answer about the basketball game.”