
                   Linux-HA Hardware Installation Guideline

   This document (c) 1999 Volker Wiegand [1]<Volker.Wiegand@suse.de>

   This  document  serves  as  the  starting  point to plan, execute, and
   verify your hardware setup for a High Availability (HA) environment.

Contents

    1. [2]Introduction
    2. [3]Hardware Requirements
         1. [4]Minimum Installation
         2. [5]More Advanced Installation
         3. [6]Fully Redundant Installation
    3. [7]Hardware Setup and Test
         1. [8]Serial Ports
         2. [9]LAN Interfaces
         3. [10]Other Devices
    4. [11]Troubleshooting
    5. [12]References

Introduction

   With  the  high  stability Linux has reached, this Operating System is
   well  suited  to  be used for HA purposes. The Linux-HA project, based
   upon  Harald  Milz's  [13]HOWTO  and  Alan Robertson's Heartbeat code,
   provides the building blocks for a professional solution.

   This  document  provides  some  advice  on  the  initial planning, the
   installation and cabling, and the test and verification of the overall
   setup.  We  use  the  word  takeover to mean transferring some kind of
   server  functionality  from  a  broken  entity  to a sane one. In this
   context,  entities  can  be  network adapters, computers, or something
   else. Our current focus is to provide HA capabilities among PC's which
   we will call "nodes" from now on.

Hardware Requirements

   Since  we  want  to  be  able  to provide failover capabilities on the
   machine  level,  we need at least two computers. Obvious, isn't it? In
   our  current  setup,  all  we  require that they are running Linux. No
   particular  distribution  is  preferred (although most tests have been
   carried out on RedHat and SuSE systems). The minimum kernel version is
   [TODO:  which  one  is  it  ???],  although  the software makes fairly
   minimal demands on the OS.

   These  two  nodes  have to be connected in some way to exchange status
   information  and  to  monitor  each other. The more channels our nodes
   have  to  talk  to  each other, the better it is. We will use the term
   "medium" for such a communication channel.

   In  general  we work from the assumption that we use standard hardware
   where  ever  possible. This means that we do not modify our PC's other
   than  to  expand  them  with components off the shelf. And we use only
   cabling  that  can  be  bought  without "special orders" such as split
   serial  cables  or  the  like. After all we want solutions that can be
   installed and used by everyone, not just some experts.

  Minimum Installation

   In  order  for  the  takeover  to work, we need at least one medium to
   exchange  messages.  Given  that  we  use  TCP/IP as the basis for our
   service,  some  kind  of  LAN is certainly available. Of course having
   only  the  LAN provides poor monitoring capabilities, but on the other
   hand this is the minimum chapter anyway :-)

   So how will the hardware be planned? Well, straight forward.
    +-------------------+  +-------------------+
    |                   |  |                   |
    |      Node A       |  |      Node B       |
    |                   |  |                   |
    |       eth0        |  |       eth0        |
    +---------+---------+  +---------+---------+
              |                      |
              |                      |
    |---------+----------------------+---------| LAN (Ethernet, etc.)

   As was mentioned before, this design obviously provides insufficiently
   reliable  monitoring  capabilities. In a LAN, there are many different
   points  of  failure.  Another issue is that the LAN is a public medium
   and that there are several levels of possible failures. So we would be
   well-advised to look for more robust options to use in addition to the
   LAN for heartbeats.

  More Advanced Installation

   So  let's  see  what  we  can  do to provide sound monitoring and good
   takeover  capabilities  and  still  not  having  to purchase excessive
   hardware or software add-ons.

   The  main  idea  is  to  have a simple private medium like one or more
   serial cables. We can use the standard serial ports, provided they are
   not  already  occupied by modems, mice, or other vermin. If you have a
   server with a PS/2 mouse, it probably has two such ports available.

   So here's what this configuration looks like.
              +----------------------+           (Nullmodem Cable)
              |                      |
    +---------+---------+  +---------+---------+
    |       ttyS0       |  |       ttyS0       |
    |                   |  |                   |
    |      Node A       |  |      Node B       |
    |                   |  |                   |
    |       eth0        |  |       eth0        |
    +---------+---------+  +---------+---------+
              |                      |
              |                      |
    |---------+----------------------+---------| LAN (Ethernet, etc.)

   What do we gain? We have now two media to exchange the heartbeat. This
   provides  greater  reliability  in  the case of failure. Of course the
   restriction  with  the  LAN still holds true, but now Node A could use
   the serial line to initiate a takeback of the service. And if just the
   serial  connection  should  fail,  we  still  have  the  LAN. Reliable
   intracluster  communications  is  very important, and this design is a
   low-cost improvement over the previous one.

  Fully Redundant Installation

   The  point of the addition of the serial links to the system is that a
   single  failure  cannot  cause  the nodes to become confused about the
   overall  system  configuration.  This is vitally important for many HA
   systems,  because  the  cost of this confusion can be scrambled disks,
   and  other  problems which are often worse than the cost of an outage.
   With more resources, the following provides a general guideline to set
   up  things.  To  illustrate  the  principle,  a  third  node  has been
   included,  but  we  can install any number of nodes in this way. Well,
   almost  any.  Note:  The  takeover code which is part of the heartbeat
   package  will  not  yet  correctly  manage takeovers for more than two
   nodes.

   The  serial  lines  are  now arranged in a ring structure. As you will
   have  noticed,  this occupies two serial ports on each node as per our
   discussion  in  the  previous chapter. But on the other hand we do now
   have  a  general setup that can easily be extended and also provides a
   good  level  of  redundancy. We can now send our heartbeat now in both
   directions  over the ring, thus reaching every other node even in case
   of a (single) cable defect (or down system) anywhere on the ring.

   Another facet of our high end design covers the LAN access. Having two
   adapters  connected  to  the  wire  allows  us  to  provide intra-node
   failover  capabilities in case of an interface or LAN cable breakdown.
   Plus  it  gives  us  the chance to take over the IP address of Node A,
   eth0  onto  Node  B,  eth1 and keeping Node B, eth0 as it is. In fact,
   this  is  the  primary operation mode of several professional systems,
   including  IBM's  HACMP  or  HP's MC/ServiceGuard. Which doesn't imply
   that we are not professional, of course :-)

   So, here is the block diagram for this third design.
                                                  (Nullmodem Cables)
          +-----------------------------------------------------+
          |       +--------------+        +-------------+       |
          |       |              |        |             |       |
    +-----+-------+-----+  +-----+--------+----+  +-----+-------+-----+
    |   ttyS0   ttyS1   |  |   ttyS0   ttyS1   |  |   ttyS0   ttyS1   |
    |                   |  |                   |  |                   |
    |      Node A       |  |      Node B       |  |      Node C       |
    |                   |  |                   |  |                   |
    |   eth0     eth1   |  |   eth0     eth1   |  |   eth0     eth1   |
    +-----+--------+----+  +-----+--------+----+  +-----+--------+----+
          |        |             |        |             |        |
          |        |             |        |             |        |
    |-----+--------+-------------+--------+-------------+--------+----|
                                                  LAN (Ethernet, etc.)

   Future  releases  of  the  heartbeat  software  will  support  such  a
   configuration,  but  current takeover code restricts the configuration
   to a single interface and two nodes in the network. Of course we could
   also  use  other  media for the heartbeat exchange. Recent suggestions
   include  SCSI  buses  in target mode and IrDA ports "connected" with a
   mirror.  Another candidate that comes to mind is the USB found in many
   modern  PC's.  As  I  said  before,  the more (and more different) the
   better.

Hardware Setup and Test

   The  following chapter deals with the installation and verification of
   the various components within the nodes.

  Serial Ports

   First  of all, let's recap how a Nullmodem Cable is wired. The pain is
   that  you  certainly  possess the pin assignment a thousand times, but
   you don't have it handy when you need it. So here it is ...
    25-pin        9-pin                          9-pin        25-pin

      2     TxD     3  --------------------------  2     RxD     3
      3     RxD     2  --------------------------  3     TxD     2
      4     RTS     7  --------------------------  8     CTS     5
      5     CTS     8  --------------------------  7     RTS     4
      7     GND     5  --------------------------  5     GND     7
      6     DSR     6  ---+----------------------  4     DTR    20
      8     DCD     1  ---+                  +---  1     DCD     8
     20     DTR     4  ----------------------+---  6     DSR     6

   Once you have these cable(s) in place you will want to test them. This
   is  fairly  easy  since  the  serial ports are usually configured with
   decent  default values. On a freshly booted Linux system we can assume
   the ports to be in a "sane" state, with the speed set to 9600 baud. If
   not,  you can do a "stty sane 9600 </dev/ttyS0" with ttyS0 replaced by
   the actual device. Please note the input redirection which selects the
   device.

   Then  you  can set up one node as receiver ("cat </dev/ttyS0") and the
   other  one  as transmitter ("echo hello >/dev/ttyS0"). Voila! What you
   expect  is  that  the "hello" is printed out at the receiver. Pressing
   Ctrl-C  on the receiver's keyboard will return you to the prompt. Then
   do the same test with mutually exchanged roles.

  LAN Interfaces

   Rumor  has  it that there is work in progress to provide some level of
   diagnostic capabilities for Ethernet adapters and wiring. I don't know
   the  actual  status,  and  can  only  suggest  to  use a shabby "ping"
   provided that the interfaces are set up correctly with "ifconfig". For
   more  information  on  Linux  ethernet,  please check the [14]Ethernet
   HOWTO.

   If  you  are  planning  to use more than one adapter per node (usually
   called  "Standby  Adapters"),  please make sure to connect them to the
   same  physical  medium  as the primary adapters. Otherwise you will of
   course  not  be  able  to  takeover  the  IP address. Having them in a
   different  subnet  is  perfectly okay. More than that: it's preferred.
   [TODO:  this  is what I learned with HACMP. Can anyone please give the
   *correct* rationale --- or rephrase the whole paragraph?]

   Note:  The  heartbeat  software  does  not  yet  support  this kind of
   configuration.

  Other Devices

   [TODO: well, to do]

Troubleshooting

   If  things  don't work in the first place -- don't panic! Usually it's
   just a trifle. Things to check include:
     * Check  the  startup messages of the kernel, e.g. using "dmesg". Is
       the  serial driver (either the standard one or the special one for
       your hardware) compiled in or available as a module?
     * Check  the  serial  port(s)  and cable(s). Do your modem and mouse
       still  work? Using a battery, a light bulb or buzzer and some wire
       you can easily verify that all pins are connected and there are no
       short  circuits.  Inexpensive  breakout  boxes  are  available for
       diagnosing  such conditions as well. They contain the light bulbs,
       the connectors and the wire in one handy little unit.
     * For  serial  ports,  the  file /proc/tty/driver/serial can be very
       helpful  for diagnosing serial port problems in Linux. It contains
       lines of this form in it:
1: uart:16550A port:2F8 irq:3 baud:19200 tx:24423 rx:4680 RTS|CTS|DTR|DSR|CD
       This  particular  line corresponds to a working "raw" serial port,
       /dev/ttyS1  with  both  sides  cabled  up correctly, and heartbeat
       active  on  both  sides.  The first number on the line is the port
       number.  The  built-in  serial  ports on PCs are numbered 0 and 1.
       With  heartbeat  only  active  on  the local side (and not the far
       side), it looks like this instead:
1: uart:16550A port:2F8 irq:3 baud:19200 tx:43558 rx:12277 RTS|DTR
       Note  the  lack  of the CTS (Clear To Send), DSR (Data Set Ready),
       and  CD  (Carrier Detect) bits on the interface. When heartbeat is
       only running on the far side interface, it looks like this:
1: uart:16550A port:2F8 irq:3 baud:19200 tx:55039 rx:12277 CTS|DSR|CD
       Note  that  when  the local port isn't active, the RTS (Request To
       Send),  and  DTR  (Data  Terminal  Ready) bits aren't active. When
       heartbeat  isn't  running on either interface, the line looks like
       this:
1: uart:16550A port:2F8 irq:3 baud:19200 tx:55039 rx:12277
       This is essentially a software breakout box.
     * Check that the cables are properly plugged into their sockets. For
       a  production  High-Availability system, it is a very good idea to
       fasten the screws in order to avoid loose contacts.
     * For  more information on diagnosing ethernet problems, consult the
       [15]Ethernet HOWTO.
     * For  more  information on diagnosing serial port problems, consult
       the [16]Serial Port HOWTO.

References

   The Linux-HA homepage on the internet is: [17]http://linux-ha.org/

   Harald  Milz' Linux-HA HOWTO that started the whole thing can be found
   at:
   [18]http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-
   HOWTO.html

   A   comprehensive   survey  on  professional  HA  solutions  is  here:
   [19]http://www.sun.com/clusters/dh.brown.pdf
   [TODO: should we include links to HACMP, Veritas, Wizard, ... ???]
     _________________________________________________________________

参照

   1. mailto:Volker.Wiegand@suse.de
   2. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#Introduction
   3. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#Hardware Requirements
   4. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#Minimum Installation
   5. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#More Advanced Installation
   6. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#Fully Redundant Installation
   7. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#Hardware Setup and Test
   8. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#Serial Ports
   9. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#LAN Interfaces
  10. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#Other Devices
  11. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#Troubleshooting
  12. file://localhost/tmp/heartbeat-2.1.4-1/doc/HardwareGuide.html#References
  13. http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
  14. http://metalab.unc.edu/LDP/HOWTO/Ethernet-HOWTO.html
  15. http://metalab.unc.edu/LDP/HOWTO/Ethernet-HOWTO.html
  16. http://metalab.unc.edu/LDP/HOWTO/Serial-HOWTO.html
  17. http://linux-ha.org/
  18. http://metalab.unc.edu/pub/Linux/ALPHA/linux-ha/High-Availability-HOWTO.html
  19. http://www.sun.com/clusters/dh.brown.pdf
