Software Distribution Platform: Difference between revisions

From
Jump to navigation Jump to search
(restructured)
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Conception for a software distribution platform for a distributed network of WRT54GS nodes.
This page is about the conception and development of a software distribution platform (SDP) for a distributed network of nodes (usually WRT54GS or WGT634U).


For information on how to build and use SDP see the [[SDP user's guide]]

=SDP=

==Abstract==
The BRN Software Distribution Platform (SDP) automatically updates all mesh nodes when new software becomes available:
* Each node periodically announces its software version.
* An out-of-date node receives the new software via Trivial File Transfer Protocol (TFTP) from its neighbor.
* Time of switch-over to new software version is coordinated among all nodes
* Fallback strategy exist if new software does not work

[[Image:Virus.png]]

==Requirements==
The identified requirements are as follows:
The identified requirements are as follows:
* dynamic network of homogenous nodes (no privileged nodes)
* dynamic network of homogenous nodes (no privileged nodes)
Line 17: Line 31:
* implementation must fit into limited hardware (e.g. 32MB RAM + 8MB flash (optional: WRT54G))
* implementation must fit into limited hardware (e.g. 32MB RAM + 8MB flash (optional: WRT54G))


==Design==

The ideas to meet those requirements:
The ideas to meet those requirements:
* discovery mechanism for newer versions on neighbours
* discovery mechanism for newer versions on neighbours
** push and/or pull
** push and/or pull
** broadcast(push) and/or query
** uses broadcast(push) and can query(pull)
* if newer version is available, fetch it (e.g. over existing TFTP)
* if newer version is available, fetch it (e.g. over existing TFTP)
* infection based distribution mechanism
* infection based distribution mechanism
** obtain code from neighbour
** new version can be injected on any connected node
** new version can be injected on any connected node
** distributed step by step to each connected node
** distributed step by step to each connected node
Line 29: Line 44:
* sign each software update package with a CA key
* sign each software update package with a CA key
* CA public key is preinstalled with the base system on every node to check authenticity of new files
* CA public key is preinstalled with the base system on every node to check authenticity of new files
* synchronize system clocks over NTP (using UDP port 123) with neighbours
* synchronize system clocks over NTP (using broadcast UDP port 123) or a simpler non-standard protocol with neighbours
** guaranteeing similar system times
** guaranteeing similar system times for software switches and logs
* include software starting time (and end time for experimental software) in each update package
* include software starting time (and end time for experimental software) in each update package
* coordinated switch over to new version if system_time > start_time
* coordinated switch over to new version if system_time > start_time
* coordinated switch-back to non-experimental version if system_time > end_time
* coordinated switch-back to non-experimental version if system_time > end_time


==Implementation==
SDP is now implemented within [[BerlinRoofNet]].

For information on how to build and use SDP see the [[SDP user's guide]]

===Fallbacks===
To provide guaranteed operationality of SDP, several different fallback mechanisms have been included in SDP as embedded in [[OpenWGT]]

Since SDP is kept simple, it only needs a few things to do it's job:
# It needs to be run
# It needs to receive packets
# It needs to send packets
# It needs to save files for software updates

Fallbacks for each of these requirements is discussed here in order:
# On system startup, a wrapper script ('''/etc/init.d/S90sdp''') is launched, checking for the user-space click+SDP process in an endless loop. If the click process is missing, it is restarted. If a re-start is in progress, a missing click process is ignored. If click was observed to not run several ($MAX_RCCLICK_FAILED) times in succession, the system is rebooted. Additionally a hardware watchdog is used by the '''/usr/sbin/watchdog''' script to handle cases of a completely crashed kernel by rebooting after 2 seconds maximum timeout.
# With every received SDP beacon from a neighbor, a click-timer is reloaded. If click can not receive beacons for a certain time, the timer is triggered which calls the script '''/var/update/fallback_stage1'''. This script may sometimes be called in error, so it counts its calls and the last time of calling. When the last time of calling is long ago, the counter is reset to 0. When the counter reached $MAX_COUNT, the node is rebooted.
# Sending packets is assumed to always work thus there is no fallback for it. This is because no software conditions causing failed sending have been observed yet. Hardware conditions can not be handled by SDP anyway.
# The loop used for 1. also checks if the ram disk space on /tmp is above the $MIN_RAMDISK_SPACE threshold. To allow space to be temporarily used - e.g. for transferring and uncompressing a new software version - it re-checks after $POLL_TIME for $MAX_RAMDISK_FAILED times.

As a last resort, all fallback methods do a reboot to get back to SDP version 0 which is stored on flash and known to work. This version 0 allows to transfer and run a bugfixed software. Even if no bugfixed software is available, this reduces the impact of rarely occuring conditions, e.g. one auto-reboot per day would hardly be noticed.

====Default SDP Configuration====
All beforementioned configuration variables are taken from
'''/etc/sdp.conf''':
MIN_RAMDISK_SPACE=5000 # minimum ramdisk space to ensure correct behaviour of SDP
MAX_RAMDISK_FAILED=3 # grace period for SDP unpacking process (in $POLL_TIME seconds)
MAX_RCCLICK_FAILED=3 # maximum number of SDP process restarts
POLL_TIME=60 # in seconds
# Missing-beacon fallback stuff
LAST_FALLBACK=300 # Ignore fallback request, if last request was n seconds ago
COUNTER_FILE=/tmp/fallback_counter
MAX_COUNT=3

There are a few conditions for these values to allow a properly running SDP, e.g. POLL_TIME should be longer than the time needed for a rcclick restart and LAST_FALLBACK must be more than twice the value of BRNSDPFALLBACKTIMEOUT in sdp/src/brnsdp.hh .
Also all MAX_* counters should not be below 2.

====Testing Fallbacks====
1a) To simulate a crashing faulty click process
while sleep 5 ; do killall click ; done
1b) To simulate a total system / kernel failure
killall watchdog
2+3) To simulate network problems
ifconfig wlan0 down
rmmod ath_pci
4) To simulate full disk condition
dd if=/dev/zero of=/tmp/zero bs=7M count=1

==TODO==
These things should be done before releasing SDP to a less benevolent environment:
* implement file size/transfer limit checks in TFTP
* try using adjtime to avoid confusing click timers (or fix or work around click timers)
* more tests

possible:
* split "filelist" from meta-info and just include hash+size of filelist in meta-info
* merge brnsdp with brnsdpgen Element
* allow specifying magic in click configuration instead as a C define

Latest revision as of 18:39, 9 November 2006

This page is about the conception and development of a software distribution platform (SDP) for a distributed network of nodes (usually WRT54GS or WGT634U).

For information on how to build and use SDP see the SDP user's guide

SDP

Abstract

The BRN Software Distribution Platform (SDP) automatically updates all mesh nodes when new software becomes available:

  • Each node periodically announces its software version.
  • An out-of-date node receives the new software via Trivial File Transfer Protocol (TFTP) from its neighbor.
  • Time of switch-over to new software version is coordinated among all nodes
  • Fallback strategy exist if new software does not work

Virus.png

Requirements

The identified requirements are as follows:

  • dynamic network of homogenous nodes (no privileged nodes)
    • nodes may be turned off or disconnected for any period of time from the remaining network
    • network may be split for any period of time
    • nodes may be added at any time
  • no centralized datastore
    • avoids single point of failure
    • less administration
  • robust and reliable updates
    • allow updates even with broken routing
    • cross-compatible beginning with version 1.0.0 (e.g. allow update of a node running 1.0.0 with version 1.5.99)
  • broken, untrustworthy and malicious nodes must not be able to install unwanted software (optional: not interfere with the updating process)
  • trustworthy developer team and CA
  • all nodes should have the same software version running most of the time (ideally every second)
  • implementation must fit into limited hardware (e.g. 32MB RAM + 8MB flash (optional: WRT54G))

Design

The ideas to meet those requirements:

  • discovery mechanism for newer versions on neighbours
    • push and/or pull
    • uses broadcast(push) and can query(pull)
  • if newer version is available, fetch it (e.g. over existing TFTP)
  • infection based distribution mechanism
    • obtain code from neighbour
    • new version can be injected on any connected node
    • distributed step by step to each connected node
  • notification and update protocol never changes after 1.0.0 is released
  • sign each software update package with a CA key
  • CA public key is preinstalled with the base system on every node to check authenticity of new files
  • synchronize system clocks over NTP (using broadcast UDP port 123) or a simpler non-standard protocol with neighbours
    • guaranteeing similar system times for software switches and logs
  • include software starting time (and end time for experimental software) in each update package
  • coordinated switch over to new version if system_time > start_time
  • coordinated switch-back to non-experimental version if system_time > end_time


Implementation

SDP is now implemented within BerlinRoofNet.

For information on how to build and use SDP see the SDP user's guide

Fallbacks

To provide guaranteed operationality of SDP, several different fallback mechanisms have been included in SDP as embedded in OpenWGT

Since SDP is kept simple, it only needs a few things to do it's job:

  1. It needs to be run
  2. It needs to receive packets
  3. It needs to send packets
  4. It needs to save files for software updates

Fallbacks for each of these requirements is discussed here in order:

  1. On system startup, a wrapper script (/etc/init.d/S90sdp) is launched, checking for the user-space click+SDP process in an endless loop. If the click process is missing, it is restarted. If a re-start is in progress, a missing click process is ignored. If click was observed to not run several ($MAX_RCCLICK_FAILED) times in succession, the system is rebooted. Additionally a hardware watchdog is used by the /usr/sbin/watchdog script to handle cases of a completely crashed kernel by rebooting after 2 seconds maximum timeout.
  2. With every received SDP beacon from a neighbor, a click-timer is reloaded. If click can not receive beacons for a certain time, the timer is triggered which calls the script /var/update/fallback_stage1. This script may sometimes be called in error, so it counts its calls and the last time of calling. When the last time of calling is long ago, the counter is reset to 0. When the counter reached $MAX_COUNT, the node is rebooted.
  3. Sending packets is assumed to always work thus there is no fallback for it. This is because no software conditions causing failed sending have been observed yet. Hardware conditions can not be handled by SDP anyway.
  4. The loop used for 1. also checks if the ram disk space on /tmp is above the $MIN_RAMDISK_SPACE threshold. To allow space to be temporarily used - e.g. for transferring and uncompressing a new software version - it re-checks after $POLL_TIME for $MAX_RAMDISK_FAILED times.

As a last resort, all fallback methods do a reboot to get back to SDP version 0 which is stored on flash and known to work. This version 0 allows to transfer and run a bugfixed software. Even if no bugfixed software is available, this reduces the impact of rarely occuring conditions, e.g. one auto-reboot per day would hardly be noticed.

Default SDP Configuration

All beforementioned configuration variables are taken from /etc/sdp.conf:

MIN_RAMDISK_SPACE=5000  # minimum ramdisk space to ensure correct behaviour of SDP
MAX_RAMDISK_FAILED=3    # grace period for SDP unpacking process (in $POLL_TIME seconds)
MAX_RCCLICK_FAILED=3    # maximum number of SDP process restarts
POLL_TIME=60            # in seconds

# Missing-beacon fallback stuff
LAST_FALLBACK=300              # Ignore fallback request, if last request was n seconds ago
COUNTER_FILE=/tmp/fallback_counter
MAX_COUNT=3

There are a few conditions for these values to allow a properly running SDP, e.g. POLL_TIME should be longer than the time needed for a rcclick restart and LAST_FALLBACK must be more than twice the value of BRNSDPFALLBACKTIMEOUT in sdp/src/brnsdp.hh . Also all MAX_* counters should not be below 2.

Testing Fallbacks

1a) To simulate a crashing faulty click process

while sleep 5 ; do killall click ; done

1b) To simulate a total system / kernel failure

killall watchdog

2+3) To simulate network problems

ifconfig wlan0 down
rmmod ath_pci

4) To simulate full disk condition

dd if=/dev/zero of=/tmp/zero bs=7M count=1

TODO

These things should be done before releasing SDP to a less benevolent environment:

  • implement file size/transfer limit checks in TFTP
  • try using adjtime to avoid confusing click timers (or fix or work around click timers)
  • more tests

possible:

  • split "filelist" from meta-info and just include hash+size of filelist in meta-info
  • merge brnsdp with brnsdpgen Element
  • allow specifying magic in click configuration instead as a C define