SeE It ThE WaY YoU ShOUld...



An eye of a kid; lost in the mayhem here on this planet..
Yet desires and desires strong..
To swim away to calm equilibrium..
Through a divine diffusion..

Saturday, February 25, 2012

Some time for time - Of Distributed Systems

"You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” This comment by Leslie Lamport pretty much sums up the very idea of a distributed system, its challenges in design and the typical emergent properties of any distributed system. 

One such important discussion that happens in the context of a distributed system is about time. Time as such, has never been defined to totality at any point in time. The simplest way to put it however is that time is a measure employed to 'order events'. When events can be sequenced, it becomes possible to compare their durations, measure intervals between them and quantify rates of change. Hence, time is anything that helps us achieve all of the above.

In a system that is not distributed, the view of time is less complicated simply because the 
components of the system are too closely coupled (in terms of physical time) to view different times at a given point in time. The challenge in a distributed system hence lies in the fact that it consists of distinct processes which communicate with each other by exchanging messages, the message transmission delay not being negligible. The consequence this emergent property has on a distributed system is that physical theories of time can never be sufficient to say that an event a happened before another event b just because a happened at an earlier time than b!
 
Leslie Lamport's most cited paper

Leslie Lamport in his paper describes a partial ordering defined by the 'happened before' relation. He then uses arbitration in the form of process identifiers to arrive at a method to order events totally. Lamport also discusses the problem of mutual exclusion in the context of a distributed system and suggests the use of total ordering to solve it.

The most important point about the importance of total ordering in a distributed system Lamport makes in his paper is that a distributed system can be described as a particular sequential state machine, that is built using multiple processes (or 'processors' that perform processes). As a result, if you can totally order input requests, you obtain an algorithm to implement a state machine that is equivalent to a distributed system.

The rest of this essay, which I believe could go on for a few more blog posts, is about my understanding/misunderstanding of a few of Lamport's ideas in the wonderful paper here
 
Partial Ordering using the happened-before relation

As defined earlier, a distributed system comprises of a collection of processes that communicate via messages. Each process consists of a sequence of events. An event is any indivisible/atomic operation that is part of a process. What can be termed an event entirely depends on the process it belongs to. Within a distributed system, ordering events becomes a non-trivial problem for reasons stated in the previous sections. It is not unreasonable to assume that a process has an apriori sequence; with its events ordered totally. In the case of events between 2 different processes however, ordering needs an efficient methodology. The first step to arriving at such a methodology is the little but significant observation of the fact that a 'send' event always precedes a corresponding 'receive' event. In other words, a send event always happens before its receiving event. This happened before relation is denoted by '->;'.

It is easy to observe the conditions needed to satisfy the '->;' relation.
  • a and b belong to the same process: 'a comes before b' implies a ->; b
  • a is the send event of one process and b is the corresponding receive event of another, implies that a ->; b
  • The '->;' relation is transitive.
Concurrent events are hence nothing but events where the ->; relation does not hold. In other words, concurrent events are those which have no impacts on one another i.e. If an event a does not causally affect event b, a and b are said to be concurrent. Otherwise, they are said be 'causally related'. This notion of causal relations among events becomes the basis for defining a system of logical clocks in a distributed system.

Lamport defines a system of logical clocks that can look extremely trivial at times, yet is a landmark in itself in the area of distributed systems. Its significance and implications are what makes one realize that nothing about this piece of Lamport's work is indeed trivial! This and a little more in my next post.

No comments:

Post a Comment