Software Distribution Platform: Difference between revisions
(+Fallbacks chapter) |
m (→Fallbacks: +watchdog +dependencies) |
||
Line 58: | Line 58: | ||
Fallbacks for each of these requirements is discussed here in order: |
Fallbacks for each of these requirements is discussed here in order: |
||
# On system startup, a wrapper script ('''/etc/init.d/S90sdp''') is launched, checking for the user-space click+SDP process in an endless loop. If the click process is missing, it is restarted. If a re-start is in progress, a missing click process is ignored. If click was observed to not run several ($MAX_RCCLICK_FAILED) times in succession, the system is rebooted. |
# On system startup, a wrapper script ('''/etc/init.d/S90sdp''') is launched, checking for the user-space click+SDP process in an endless loop. If the click process is missing, it is restarted. If a re-start is in progress, a missing click process is ignored. If click was observed to not run several ($MAX_RCCLICK_FAILED) times in succession, the system is rebooted. Additionally a hardware watchdog is used by the '''/usr/sbin/watchdog''' script to handle cases of a completely crashed kernel by rebooting after 2 seconds maximum timeout. |
||
# With every received SDP beacon from a neighbor, a click-timer is reloaded. If click can not receive beacons for a certain time, the timer is triggered which calls the script '''/var/update/fallback_stage1'''. This script may sometimes be called in error, so it counts its calls and the last time of calling. When the last time of calling is long ago, the counter is reset to 0. When the counter reached $MAX_COUNT, the node is rebooted. |
# With every received SDP beacon from a neighbor, a click-timer is reloaded. If click can not receive beacons for a certain time, the timer is triggered which calls the script '''/var/update/fallback_stage1'''. This script may sometimes be called in error, so it counts its calls and the last time of calling. When the last time of calling is long ago, the counter is reset to 0. When the counter reached $MAX_COUNT, the node is rebooted. |
||
# Sending packets is assumed to always work thus there is no fallback for it. This is because no software conditions causing failed sending have been observed yet. Hardware conditions can not be handled by SDP anyway. |
# Sending packets is assumed to always work thus there is no fallback for it. This is because no software conditions causing failed sending have been observed yet. Hardware conditions can not be handled by SDP anyway. |
||
# The loop used for 1. also checks if the ram disk space on /tmp is above |
# The loop used for 1. also checks if the ram disk space on /tmp is above the $MIN_RAMDISK_SPACE threshold. To allow space to be temporarily used - e.g. for transferring and uncompressing a new software version - it re-checks after $POLL_TIME for $MAX_RAMDISK_FAILED times. |
||
As a last resort, all fallback methods do a reboot to get back to SDP version 0 which is stored on flash and known to work. This version 0 allows to transfer and run a bugfixed software. Even if no bugfixed software is available, this reduces the impact of rarely occuring conditions, e.g. one auto-reboot per day would hardly be noticed. |
As a last resort, all fallback methods do a reboot to get back to SDP version 0 which is stored on flash and known to work. This version 0 allows to transfer and run a bugfixed software. Even if no bugfixed software is available, this reduces the impact of rarely occuring conditions, e.g. one auto-reboot per day would hardly be noticed. |
||
====Default SDP configuration==== |
====Default SDP configuration==== |
||
All beforementioned configuration variables are taken from |
|||
'''/etc/sdp.conf''': |
'''/etc/sdp.conf''': |
||
MIN_RAMDISK_SPACE=5000 # minimum ramdisk space to ensure correct behaviour of SDP |
MIN_RAMDISK_SPACE=5000 # minimum ramdisk space to ensure correct behaviour of SDP |
||
Line 76: | Line 77: | ||
COUNTER_FILE=/tmp/fallback_counter |
COUNTER_FILE=/tmp/fallback_counter |
||
MAX_COUNT=3 |
MAX_COUNT=3 |
||
There are a few conditions for these values to allow a properly running SDP, e.g. POLL_TIME should be longer than the time needed for a rcclick restart and LAST_FALLBACK must be more than twice the value of BRNSDPFALLBACKTIMEOUT in sdp/src/brnsdp.hh . |
|||
Also all MAX_* counters should not be below 2. |
Revision as of 16:18, 19 December 2005
This page is about the conception and development of a software distribution platform (SDP) for a distributed network of nodes (usually WRT54GS or WGT634U).
For information on how to build and use SDP see the SDP user's guide
SDP
Requirements
The identified requirements are as follows:
- dynamic network of homogenous nodes (no privileged nodes)
- nodes may be turned off or disconnected for any period of time from the remaining network
- network may be split for any period of time
- nodes may be added at any time
- no centralized datastore
- avoids single point of failure
- less administration
- robust and reliable updates
- allow updates even with broken routing
- cross-compatible beginning with version 1.0.0 (e.g. allow update of a node running 1.0.0 with version 1.5.99)
- broken, untrustworthy and malicious nodes must not be able to install unwanted software (optional: not interfere with the updating process)
- trustworthy developer team and CA
- all nodes should have the same software version running most of the time (ideally every second)
- implementation must fit into limited hardware (e.g. 32MB RAM + 8MB flash (optional: WRT54G))
Design
The ideas to meet those requirements:
- discovery mechanism for newer versions on neighbours
- push and/or pull
- uses broadcast(push) and can query(pull)
- if newer version is available, fetch it (e.g. over existing TFTP)
- infection based distribution mechanism
- obtain code from neighbour
- new version can be injected on any connected node
- distributed step by step to each connected node
- notification and update protocol never changes after 1.0.0 is released
- sign each software update package with a CA key
- CA public key is preinstalled with the base system on every node to check authenticity of new files
- synchronize system clocks over NTP (using broadcast UDP port 123) or a simpler non-standard protocol with neighbours
- guaranteeing similar system times for software switches and logs
- include software starting time (and end time for experimental software) in each update package
- coordinated switch over to new version if system_time > start_time
- coordinated switch-back to non-experimental version if system_time > end_time
Implementation
SDP is now implemented within BerlinRoofNet.
For information on how to build and use SDP see the SDP user's guide
Fallbacks
To provide guaranteed operationality of SDP, several different fallback mechanisms have been included in SDP as embedded in OpenWGT
Since SDP is kept simple, it only needs a few things to do it's job:
- It needs to be run
- It needs to receive packets
- It needs to send packets
- It needs to save files for software updates
Fallbacks for each of these requirements is discussed here in order:
- On system startup, a wrapper script (/etc/init.d/S90sdp) is launched, checking for the user-space click+SDP process in an endless loop. If the click process is missing, it is restarted. If a re-start is in progress, a missing click process is ignored. If click was observed to not run several ($MAX_RCCLICK_FAILED) times in succession, the system is rebooted. Additionally a hardware watchdog is used by the /usr/sbin/watchdog script to handle cases of a completely crashed kernel by rebooting after 2 seconds maximum timeout.
- With every received SDP beacon from a neighbor, a click-timer is reloaded. If click can not receive beacons for a certain time, the timer is triggered which calls the script /var/update/fallback_stage1. This script may sometimes be called in error, so it counts its calls and the last time of calling. When the last time of calling is long ago, the counter is reset to 0. When the counter reached $MAX_COUNT, the node is rebooted.
- Sending packets is assumed to always work thus there is no fallback for it. This is because no software conditions causing failed sending have been observed yet. Hardware conditions can not be handled by SDP anyway.
- The loop used for 1. also checks if the ram disk space on /tmp is above the $MIN_RAMDISK_SPACE threshold. To allow space to be temporarily used - e.g. for transferring and uncompressing a new software version - it re-checks after $POLL_TIME for $MAX_RAMDISK_FAILED times.
As a last resort, all fallback methods do a reboot to get back to SDP version 0 which is stored on flash and known to work. This version 0 allows to transfer and run a bugfixed software. Even if no bugfixed software is available, this reduces the impact of rarely occuring conditions, e.g. one auto-reboot per day would hardly be noticed.
Default SDP configuration
All beforementioned configuration variables are taken from /etc/sdp.conf:
MIN_RAMDISK_SPACE=5000 # minimum ramdisk space to ensure correct behaviour of SDP MAX_RAMDISK_FAILED=3 # grace period for SDP unpacking process (in $POLL_TIME seconds) MAX_RCCLICK_FAILED=3 # maximum number of SDP process restarts POLL_TIME=60 # in seconds # Missing-beacon fallback stuff LAST_FALLBACK=300 # Ignore fallback request, if last request was n seconds ago COUNTER_FILE=/tmp/fallback_counter MAX_COUNT=3
There are a few conditions for these values to allow a properly running SDP, e.g. POLL_TIME should be longer than the time needed for a rcclick restart and LAST_FALLBACK must be more than twice the value of BRNSDPFALLBACKTIMEOUT in sdp/src/brnsdp.hh . Also all MAX_* counters should not be below 2.