Distributed Systems
Introduction
Out of the many possible definitions of a distributed system, we herein employ the following, which is perhaps the most general (and in particular supercedes Lamport's humorous remark that "a distributed system is the one that prevents you from working because of the failure of a machine that you had never heard of"):
- Distributed System
A distributed system is comprised by a set of distributed (i.e., not located in the same spot) components (hardware and/or software) working together to provide one service.
That is, there are two characteristics to a distributed system: (1) the service it provides emerges at the system level and (2) components must communicate over links (which are components themselves). The first implies that parts of the system, although they might be usable stand-alone, are not what we mean by the term -- a cell-phone's games or calendar functions are not distributed, but the service "communicate by voice independently of location" is. The second narrows down the range: programs on the same host and communicating over IPC are not considered distributed. Likewise, both points affect security of such a system. Consistent with (1), in a distributed setting new security problems arise, which are not limited to (2), that communication is potentially dangerous.
There are at least two main reasons for distribution: The service may require it (e.g. credit card systems have to be available in many locations) or availability considerations may make it necessary to remove single points of failure by replicating components. In combination with the characteristics mentioned above, these result in several topical areas for security. First, distributing a system's components strengthens the importance of trust relations between components and of the system as a whole regarding its interactions with the environment. Consequently, we will focus on trust in the first section of the following text. Service requirements are the second field we will take a look at; here, our main aim will be outlining the implications of concurrency. Thirdly, we will briefly consider the interplay of fault tolerance and security and explore how -- and if at all -- security of distributed systems can benefit from fault tolerance research, which will lead us to the concept of intrusion tolerance.
Trust
Concurrency
Fault Tolerance
As lined out in the introduction, one main reason for deploying a system in a distributed manner is a particular, task-specific demand on the service it provides. Obviously, such a demand also leads to certain requirements regarding the service's availability, e.g., the mobile-phone system is of little use if it does not work.
Insights from Dependability
Dependability research has obtained a few insights into the complexities of these problems as well as into the effectiveness of possible solutions. Its most basic model is the famous "Fault-Error-Failure" sequence of how computer systems (of any kind) end up at a failure state, i.e., not conforming to specified system or service behaviour:
File:DS Fault-Error-Failure-1.png
A fault -- some minor inconsistency with the specification, introduced by the operator or the designer -- in a component is triggered and leads to an error -- a state where a part of the component does not conform to its specified behaviour --, which, if untreated, eventually results in the failure of the component. For example, a single bit in a memory module may permanently stay at value 1 under certain environment conditions. When a program uses the module's value as part of an address and this address happens to be outside the program's memory range, the program could be killed by the OS and thus the program may fail to provide its service.
The chain just described can be broken at several points: First, fault and error prevention measures can be applied to stop the introduction of faults and their resulting in errors, both by design as well as by operation. Second, one can seek to remove faults after identifying them. However, common experience says that neither can be fully effective, and thus there will always be faults. Taking this into account, one has to tolerate said faults -- i.e. find ways to prevent them from causing failures:
File:DS Fault-Error-Failure-2.png
Fault Tolerance approaches generally rely on the introduction of redundancy in the system, meaning redundancy in space (e.g., having backup components -- "spares" -- or replicated components voting on the correct results), in time (e.g., repeating calculations or employing restarts) and combinations thereof. Based on assumptions regarding fault classes, fault distributions and failure models, a system's availability (and thus the effectiveness of methods) can be quantified, e.g., "five nines."
Interplay between Fault Tolerance measures and System Security
Behold, the fool saith, "Put not all thine eggs in the one basket" -- which is but a manner of saying, "Scatter your money and your attention;" but the wise man saith, "Put all your eggs in the one basket and -- WATCH THAT BASKET."
- -- Mark Twain, "Pudd'nhead Wilson's Calendar"
Since the fault tolerance techniques just mentioned constitute a significant change in system design, their effects on system security must be taken into account. One example is replication: Obviously, it quickly becomes impossible to ascertain physical security of replicated components when these are distributed geographically in order to ensure the service availability (e.g., another ATM could be within walking distance, so the service is still available). Furthermore, one also has to look at possible effects of security measures on fault tolerance. It may be much easier to secure a single server than a whole network -- but by doing so a single point of failure is created: this component's failure alone constitues a system failure.
Intrusion Tolerance
So, obviously fault tolerance and security can interfere, but can the security field learn from fault tolerance, and if so, are there limits in regard to applying its results?
The first point that comes up in answering the question is the similarity of the basic models. One can identify the concept of a fault with that of a vulnerability, which, when triggered by an attack -- i.e., an action by the environment -- leads to an intrusion, which in turn may lead to a failure in the security domain, that is, the system failing to meet some or all of its stated security properties such as confidentiality:
This has been dubbed the AVI (Attack-Vulnerability-Intrusion) model.
The picture above differs from the classical model in that faults may also
be introduced by the environment. E.g., an attacker, having already gained
partial access, may attempt to introduce additional vulnerabilities.
Nonetheless, the models are almost identical, and unsurprisingly the classes
of methods are as well:
Here, prevention and removal of vulnerabilities are achieved through hardening
systems by e.g. employing secure design methods and patching
security holes, respectively, which is the classical security paradigm.
One conclusion from dependability research is that neither prevention nor
removal can be considered perfect, so one has to doubt whether assuming them to be
sufficient in the security area is reasonable.
The more a system grows, the less convincing it is to state that all parts can
be made perfectly secure.
Fault tolerance resulted from the very same fact, so intrusion tolerance, that
is, accepting the occurence of intrusions and working around them, should be considered.
Heap of unused text
Distributed systems allow parts of the system to be located on separate computers and different locations. So business logic and data can be reached from any remote computer (location).
Distributed objects are the most recent development in distributed computing. Distributed object technologies such as Java RMI, CORBA, and DCOM allow objects running on one machine to be used by applications on different computers.
There was a question in the seminar regarding the splitting of secret keys. Read about one possibility at Secret Sharing.