I have recently read several papers on services that provide coordination for distributed systems. This turns out to be a fascinating area, and as I read each paper, I found myself getting sucked into more. I group these papers into two different categories: one type involves services that provide mechanisms for distributed coordination (including ZooKeeper, Chubby, and Sinfonia); the other involves protocols that guarantee certain properties despite various types of failures (such as Paxos, PBFT, Aardvark, and Zab).
The high-level papers describe specific services for distributed coordination. A group of servers (five seems to be a popular number) communicate with each other and agree on state. Clients communicate with one or more of these servers to read or modify this global state. The main property is that if one or two of the five servers fail, the others can keep the service running, even if the "leader" fails. Various guarantees about consistency may be provided--not only is consistency difficult to achieve in the event of different types of failures, but being able to tolerate failures usually requires sacrificing performance. I was very impressed by the work that has been done in this area.
The low-level papers were also fun. Protocols are designed to withstand Byzantine faults, which seem to encompass just about anything that can go wrong in a distributed system, including crashed servers, lost or repeated messages, corrupted data, and inconsistencies. The Practical Byzantine Fault Tolerance (PBFT) algorithm, introduced in 1999, seems to have launched a whole range of fascinating research. It reminds me of security in that cynical thinking is critical.