Software Distribution Platform

From
Revision as of 18:37, 9 November 2006 by Zubow (talk | contribs) (→‎SDP)
Jump to navigation Jump to search

This page is about the conception and development of a software distribution platform (SDP) for a distributed network of nodes (usually WRT54GS or WGT634U).

For information on how to build and use SDP see the SDP user's guide

SDP

Virus.png

Requirements

The identified requirements are as follows:

  • dynamic network of homogenous nodes (no privileged nodes)
    • nodes may be turned off or disconnected for any period of time from the remaining network
    • network may be split for any period of time
    • nodes may be added at any time
  • no centralized datastore
    • avoids single point of failure
    • less administration
  • robust and reliable updates
    • allow updates even with broken routing
    • cross-compatible beginning with version 1.0.0 (e.g. allow update of a node running 1.0.0 with version 1.5.99)
  • broken, untrustworthy and malicious nodes must not be able to install unwanted software (optional: not interfere with the updating process)
  • trustworthy developer team and CA
  • all nodes should have the same software version running most of the time (ideally every second)
  • implementation must fit into limited hardware (e.g. 32MB RAM + 8MB flash (optional: WRT54G))

Design

The ideas to meet those requirements:

  • discovery mechanism for newer versions on neighbours
    • push and/or pull
    • uses broadcast(push) and can query(pull)
  • if newer version is available, fetch it (e.g. over existing TFTP)
  • infection based distribution mechanism
    • obtain code from neighbour
    • new version can be injected on any connected node
    • distributed step by step to each connected node
  • notification and update protocol never changes after 1.0.0 is released
  • sign each software update package with a CA key
  • CA public key is preinstalled with the base system on every node to check authenticity of new files
  • synchronize system clocks over NTP (using broadcast UDP port 123) or a simpler non-standard protocol with neighbours
    • guaranteeing similar system times for software switches and logs
  • include software starting time (and end time for experimental software) in each update package
  • coordinated switch over to new version if system_time > start_time
  • coordinated switch-back to non-experimental version if system_time > end_time


Implementation

SDP is now implemented within BerlinRoofNet.

For information on how to build and use SDP see the SDP user's guide

Fallbacks

To provide guaranteed operationality of SDP, several different fallback mechanisms have been included in SDP as embedded in OpenWGT

Since SDP is kept simple, it only needs a few things to do it's job:

  1. It needs to be run
  2. It needs to receive packets
  3. It needs to send packets
  4. It needs to save files for software updates

Fallbacks for each of these requirements is discussed here in order:

  1. On system startup, a wrapper script (/etc/init.d/S90sdp) is launched, checking for the user-space click+SDP process in an endless loop. If the click process is missing, it is restarted. If a re-start is in progress, a missing click process is ignored. If click was observed to not run several ($MAX_RCCLICK_FAILED) times in succession, the system is rebooted. Additionally a hardware watchdog is used by the /usr/sbin/watchdog script to handle cases of a completely crashed kernel by rebooting after 2 seconds maximum timeout.
  2. With every received SDP beacon from a neighbor, a click-timer is reloaded. If click can not receive beacons for a certain time, the timer is triggered which calls the script /var/update/fallback_stage1. This script may sometimes be called in error, so it counts its calls and the last time of calling. When the last time of calling is long ago, the counter is reset to 0. When the counter reached $MAX_COUNT, the node is rebooted.
  3. Sending packets is assumed to always work thus there is no fallback for it. This is because no software conditions causing failed sending have been observed yet. Hardware conditions can not be handled by SDP anyway.
  4. The loop used for 1. also checks if the ram disk space on /tmp is above the $MIN_RAMDISK_SPACE threshold. To allow space to be temporarily used - e.g. for transferring and uncompressing a new software version - it re-checks after $POLL_TIME for $MAX_RAMDISK_FAILED times.

As a last resort, all fallback methods do a reboot to get back to SDP version 0 which is stored on flash and known to work. This version 0 allows to transfer and run a bugfixed software. Even if no bugfixed software is available, this reduces the impact of rarely occuring conditions, e.g. one auto-reboot per day would hardly be noticed.

Default SDP Configuration

All beforementioned configuration variables are taken from /etc/sdp.conf:

MIN_RAMDISK_SPACE=5000  # minimum ramdisk space to ensure correct behaviour of SDP
MAX_RAMDISK_FAILED=3    # grace period for SDP unpacking process (in $POLL_TIME seconds)
MAX_RCCLICK_FAILED=3    # maximum number of SDP process restarts
POLL_TIME=60            # in seconds

# Missing-beacon fallback stuff
LAST_FALLBACK=300              # Ignore fallback request, if last request was n seconds ago
COUNTER_FILE=/tmp/fallback_counter
MAX_COUNT=3

There are a few conditions for these values to allow a properly running SDP, e.g. POLL_TIME should be longer than the time needed for a rcclick restart and LAST_FALLBACK must be more than twice the value of BRNSDPFALLBACKTIMEOUT in sdp/src/brnsdp.hh . Also all MAX_* counters should not be below 2.

Testing Fallbacks

1a) To simulate a crashing faulty click process

while sleep 5 ; do killall click ; done

1b) To simulate a total system / kernel failure

killall watchdog

2+3) To simulate network problems

ifconfig wlan0 down
rmmod ath_pci

4) To simulate full disk condition

dd if=/dev/zero of=/tmp/zero bs=7M count=1

TODO

These things should be done before releasing SDP to a less benevolent environment:

  • implement file size/transfer limit checks in TFTP
  • try using adjtime to avoid confusing click timers (or fix or work around click timers)
  • more tests

possible:

  • split "filelist" from meta-info and just include hash+size of filelist in meta-info
  • merge brnsdp with brnsdpgen Element
  • allow specifying magic in click configuration instead as a C define