
                         Linux-HA Phase I Requirements

   This  document  describes a set of general Linux-HA requirements which
   have  been  presented  to  the  list, and no objections were made.  Of
   course,  it  would have been nice to have had a good discussion, but I
   take silence to mean assent :-)

Linux-HA Phase I General Goals

     * Simple
     * Reliable
     * Easy to configure
     * Easy to test
     * Easy to monitor
     * Redundant hardware and software are verified for working condition

Linux-HA Phase I Requirements

   The  short-term  goal  of  Phase  I  is  to  provide  a more realistic
   demonstration of Linux-HA, in a form that will actually be usable to a
   certain set of customers (users) in a production sense.

   This  demonstration  is  focused  on  providing  High-Availability web
   service.   The  rationale  for providing web service is simple:  It is
   well-understood,  and  Linux  has  a  significant  presence in the web
   server  market.  This will provide more initial users and testers than
   most other applications.

   The  following  minimal requirements on the web service are considered
   sufficient for this demonstration:
     * An  active/standby  methodology  is acceptable.  Load sharing need
       not be explicitly supported
     * Data  on  standby  nodes must be continually replicated from their
       paired   active   nodes  over  dedicated  LANs.   I  am  referring
       specifically to application data, not cluster config data.

     * Comment:  It is expected that we will use "poor man's replication"
       between the active and standby nodes

     IP  address  takeover  between  active and standby hosts. Ability to
   start and stop applications as IP addresses move around the cluster.

     Basic cluster monitoring capabilities via /proc-like interface

     Simple configuration and installation documentation

     Basic support for either resource groups or dependencies

Restrictions Allowed for Demonstration

   The   following   restrictions   are  considered  acceptable  for  the
   demonstration.
     * It is not necessary to provide load sharing between members of the
       cluster (An active/standby methodology is acceptable)
     * A single active/standby pair is sufficient at the beginning
     * No  application  level notification of cluster transitions need be
       provided (though see the stop/start requirement above)
     * No hardware diagnostics need be provided

Post-Demonstration Requirement Candidates

   After  these  demonstration requirements have been met, it is expected
   that  the following capabilities will be added (not listed in priority
   order):
     * Integration   of   hardware  and  software  diagnostics  into  the
       architecture
     * Support  for  in-node  IP interface failover (failing between NICs
       within a single host)
     * Application  notification  of  cluster  transitions  (support  for
       arbitrary application failover)
     * Plug-in modules interface available for cluster management

     * {allowing support for: active/standby, n+1, load sharing, etc.}

     Cluster management uses diagnostic information in failover strategy

     Arbitrary number of nodes in cluster.

     Multiple pairs of active/standby servers in the cluster

     Easily  configured support for common servers like these: ftp, smtp,
   pop3, imap, DNS, others (?)
   This  is intended to be something more sophisticated than changing run
   levels.   Changing  run levels only supports the active/standby model.
   Note  that  these  kinds  of  services may be started and stopped with
   /etc/rc.d/init.d scripts, but will not likely be tied to run levels.

     Load sharing between the active/replicator servers via NFS (?)

     Support for other replication configurations.  For example:
     * Shared SCSI
     * GFS
     * User-defined replication methods

     Sophisticated, cool GUI monitoring capabilities

     Cool GUI configuration tools

     Other  cool  and feasible things such as people are moved to do them
   :-)

     I  have a bias against making the customer to write shell scripts to
   move  resources  around  for  "normal" cases.  This is in harmony with
   "easy to configure" and Cool GUI configuration tools.
