
                                 Faq'n Tips

    1. [1]Hey!  This doesn't look like a FAQ!  What gives?
    2. [2]Are there mailing lists for Linux-HA?
    3. [3]What is a cluster?
    4. [4]What is a resource script?
    5. [5]How to monitor various resources?. If one of my resources stops
       working  heartbeat  doesn't do anything unless the server crashes.
       How do I monitor resources with heartbeat?
    6. [6]If   my  one  of  my  ethernet  connections  goes  away  (cable
       severance,  NIC  failure,  locusts),  but  my current primary node
       (the  one  with the services) is otherwise fine, no one can get to
       my services and I want to fail them over to my other cluster node.
        Is there a way to do this?
    7. [7]Every  time my machine releases an IP alias, it loses the whole
       interface (i.e. eth0)!  How do I fix this?
    8. [8]I  want  a  lot  of  IP  addresses  as resources (more than 8).
       What's the best way?
    9. [9]The  documentation indicates that a serial line is a good idea,
       is there really a drawback to using two ethernet connections?
   10. [10]How to use heartbeat with ipchains firewall?
   11. [11]I got this message ERROR: No local heartbeat. Forcing shutdown
       and then heartbeat shut itself down for no reason at all!
   12. [12]How  to  tune  heartbeat  on  heavily  loaded  system to avoid
       split-brain?
   13. [13]When heartbeat starts up I get this error message in my logs:
            WARN:    process_clustermsg:    node    [<hostname>]   failed
       authentication
       [14]What does this mean?
   14. [15]When I try to start heartbeat i receive message: [16]"Starting
       High-Availability services: Heartbeat failure [rc=1]. Failed.
       [17]and  there is nothing in any of the log files and no messages.
       What is wrong ?
   15. [18]How to run multiple clusters on same network segment ?
   16. [19]How to get latest CVS version of heartbeat ?
   17. [20]Heartbeat on other OSs.
   18. [21]When  I  try  to install the linux-ha.org heartbeat RPMs, they
       complain  of  dependencies from packages I already have installed!
       Now what?
   19. [22]I don't want heartbeat to fail over the cluster automatically.
        How can I require human confirmation before failing over?
   20. [23]What is STONITH?  And why might I need it?
   21. [24]How  do  I  figure out what STONITH devices are available, and
       how to configure them?
   22. [25]I  want to use a shared disk, but I don't want to use STONITH.
        Any recommendations?
   23. [26]Can heartbeat be configured in an active/active configuration?
       If so, how do I do this, since the haresources file is supposed to
       be the same on each box so I do not know how this could be done.
   24. [27]Why  are  my  interface  names  getting truncated when they're
       brought up and down?
   25. [28]What is this auto_failback parameter? What happened to the old
       nice_failback parameter?
   26. [29]I  am  upgrading  from  a  version of Linux-HA which supported
       nice_failback  to  one that supports auto_failback. How to I avoid
       a flash cut in this upgrade?
   27. [30]If nothing helps, what should I do ?
   28. [31]I want to submit a patch, how do I do that?
   ______________________________________________________________________


    1. Quit your bellyachin'!  We needed a "catch-all" document to supply
       useful  information  in a way that was easily referenced and would
       grow  without  a  lot of work.  It's closer to a FAQ than anything
       else.
    2. Yes!   There  are  two public mailing lists for Linux-HA.  You can
       find out about them by visiting [32]http://linux-ha.org/contact/.
    3. HA  (High availability Cluster) - A cluster that allows a host (or
       hosts)  to  become  Highly  Available. This means that if one node
       goes  down  (or a service on that node goes down) another node can
       pick up the service or node and take over from the failed machine.
       [33]http://linux-ha.org
       Computing  Cluster  - This is what a Beowulf cluster is. It allows
       distributed  computing over off the shelf components. In this case
       it is usually cheap IA32 machines. [34]http://www.beowulf.org/
       Load  balancing  clusters  - This is what the Linux Virtual Server
       project  does.  In  this  scenario  you have one machine with load
       balances  requests to a certain server (apache for example) over a
       farm of servers. [35]www.linuxvirtualserver.org
       All  of  these  sites  have  howtos  etc.  on  them. For a general
       overview on clustering under Linux, look at the Clustering HOWTO.
    4. Resource  scripts  are basically (extended) System V init scripts.
       They  must  support  stop,  start,  and status operations.  In the
       future  we  will  also  add  support for a "monitor" operation for
       monitoring services as you requested. The IPaddr script implements
       this  new  "monitor" operation now (but heartbeat doesn't use that
       function of it). For more info see Resource HOWTO.
    5. Heartbeat   itself   was   not  designed  for  monitoring  various
       resources.  If  you  need  to monitor some resources (for example,
       availability  of  WWW  server) you need some third party software.
       Mon is a reasonable solution.
         A. Get Mon from [36]http://kernel.org/software/mon/.
         B. Get all required modules listed. You can find them at nearest
            mirror  or  at the CPAN archive (www.cpan.org). I am not very
            familiar with Perl, so I downloaded them from CPAN archive as
            .tar.gz  packages  and  installed them in the usual way (perl
            Makefile.pl && make && make test && make install).
         C. Mon  is  software for monitoring different network resources.
            It can ping computers, connect to various ports, monitor WWW,
            MySQL  etc.  In  case  of  dysfunction  of  some resources it
            triggers some scripts.
         D. Unpack  mon  in some directory. Best starting point is README
            file. Complete documentation is in the <dir>/doc, where <dir>
            is the place you unpacked mon package.
         E. For a fast start do following steps:
              a. copy all subdirs found in <dir> to /usr/lib/mon
              b. create dir /etc/mon
              c. copy auth.cf from <dir>/etc to /etc/mon
            Now,  mon  is  prepared  to work. You need to create your own
            mon.cf  file,  where you should point to resources mon should
            watch  and  actions mon will start in case of dysfunction and
            when  resources are available again.   All monitoring scripts
            are  in /usr/lib/mon/mon.d/. At the beginning of every script
            you can find explanation how to use it.
            All  alert scripts are placed in /usr/lib/mon/alert.d/. Those
            are  scripts  triggered in case something went wrong. In case
            you are using ipvs on theirs homepage
            (www.linuxvirtualserver.org)  you can find scripts for adding
            and removing servers from an ipvs list.
    6. Yes!   Use  the  ipfail  plug-in.   For each interface you wish to
       monitor, specify one or more "ping" nodes or "ping groups" in your
       configuration.   Each node in your cluster will monitor these ping
       nodes or groups.  Should one node detect a failure in one of these
       ping  nodes,  it will contact the other node in order to determine
       whether  it or the ping node has the problem.  If the cluster node
       has  the problem, it will try to failover its resources (if it has
       any).
       To  use  ipfail,  you  will  need  to  add  the  following to your
       /etc/ha.d/ha.cf files:
               respawn hacluster /usr/lib/heartbeat/ipfail
               ping <IPaddr1> <IPaddr2> ... <IPaddrN>
       See [37]Kevin's documentation for more details on the concepts.
       IPaddr1..N  are  your  ping  nodes.   NOTE:   ipfail  requires the
       auto_failback option to be set to on or off (not legacy).
    7. This  isn't  a  problem  with  heartbeat,  but rather is caused by
       various versions of net-tools.  Upgrade to the most recent version
       of  net-tools  and it will go away.  You can test it with ifconfig
       manually.
    8. Instead  of  failing  over  many  IP addresses, just fail over one
       router  address.   On your router, do the equivalent of "route add
       -net  x.x.x.0/24  gw  x.x.x.2",  where  x.x.x.2  is the cluster IP
       address  controlled by heartbeat.  Then, make every address within
       x.x.x.0/24  that  you wish to failover a permanent alias of lo0 on
       BOTH  cluster  nodes.   This  is  done  via "ifconfig lo:2 x.x.x.3
       netmask 255.255.255.255 -arp" etc...
    9. If anything makes your ethernet / IP stack fail, you may lose both
       connections.  You  definitely  should  run the cables differently,
       depending on how important your data is...
   10. To make heartbeat work with ipchains, you must accept incoming and
       outgoing traffic on 694 UDP port. Add something like
       /sbin/ipchains  -A  output  -i  ethN  -p  udp  -s  <source_IP>  -d
       <dest_IP>  -j ACCEPT
       /sbin/ipchains   -A  input  -i  ethN  -p  udp  -s  <source_IP>  -d
       <dest_IP>  -j ACCEPT
   11. This can be caused by one of two things:
          + System under heavy I/O load, or
          + Kernel bug.
       For  how  to  deal  with the first occurrence (heavy load), please
       read the answer to the [38]next FAQ item.
       If  your  system  was not under moderate to heavy load when it got
       this  message,  you probably have the kernel bug. The 2.4.18 Linux
       kernel  had  a  bug  in  it  which  would cause it to not schedule
       heartbeat  for very long periods of time when the system was idle,
       or  nearly  so. If this is the case, you need to get a kernel that
       isn't broken.
   12. "No  local  heartbeat" or "Cluster node returning after partition"
       under  heavy  load  is  typically  caused  by too small a deadtime
       interval. Here is suggestion for how to tune deadtime:
          + Set deadtime to 60 seconds or higher
          + Set warntime to whatever you *want* your deadtime to be.
          + Run your system under heavy load for a few weeks.
          + Look  at  your  logs  for the longest time either system went
            without hearing a heartbeat.
          + Set your deadtime to 1.5-2 times that amount.
          + Set warntime to a little less than that amount.
          + Continue  to  monitor  logs for warnings about long heartbeat
            times.  If  you  don't do this, you may get "Cluster node ...
            returning  after  partition"  which  will  cause heartbeat to
            restart  on  all  machines  in  the cluster. This will almost
            certainly annoy you.
       Adding memory to the machine generally helps. Limiting workload on
       the  machine  generally  helps.  Newer versions of heartbeat are a
       better about this than pre 1.0 versions.
   13. It's  common  to  get  a  single  mangled  packet  on  your serial
       interface when heartbeat starts up.  This message is an indication
       that  we  received  a  mangled  packet.   It's  harmless  in  this
       scenario.  If  it happens continually, there is probably something
       else going on.
   14. It's  probably  a permissions problem on authkeys.  It wants it to
       be  read only mode (400, 600 or 700).  Depending on where and when
       it  discovers  the  problem, the message will wind up in different
       places.
       But, it tends to be in
         1. stdout/stderr
         2. wherever you specified in your setup
         3. /var/log/messages
       Newer  releases are better about also putting out startup messages
       to stderr in addition to wherever you have configured them to go.
   15. Use  multicast  and give each its own multicast group. If you need
       to/want  to use broadcast, then run each cluster on different port
       numbers.   An  example of a configuration using multicast would be
       to have the following line in your ha.cf file:
            mcast eth0 224.1.2.3 694 1 0
       This  sets eth0 as the interface over which to send the multicast,
       224.1.2.3 as the multicast group (will be same on each node in the
       same cluster), udp port 694 (heartbeat default), time to live of 1
       (limit  multicast  to  local  network  segment  and  not propagate
       through routers), multicast loopback disabled (typical).
   16. There  is  a  CVS  repository  for  Linux-HA.  You  can find it at
       cvs.linux-ha.org.   Read-only  access is via login guest, password
       guest,  module  name linux-ha. More details are to be found in the
       [39]announcement  email.   It  is  also  available through the web
       using viewcvs at
       [40]http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/
   17. Heartbeat now uses use automake and is generally quite portable at
       this point. Join the Linux-HA-dev mailing list if you want to help
       port it to your favorite platform.
   18. Due  to  distribution  RPM  package  name  differences,  this  was
       unavoidable.   If  you're  not  using  STONITH, use the "--nodeps"
       option  with  rpm.   Otherwise,  use the heartbeat source to build
       your  own RPMs.  You'll have the added dependencies of autoconf >=
       2.53 and libnet (get it from
       [41]http://www.packetfactory.net/libnet).    Use   the   heartbeat
       source RPM (preferred) or unpack the heartbeat source and from the
       top  directory, run "./ConfigureMe rpm".  This will build RPMS and
       place  them  where  it's customary for your particular distro.  It
       may even tell you if you are missing some other required packages!
   19. You  configure  a  "meatware"  STONITH device into the ha.cf file.
       The  meatware  STONITH  device asks the operator to go power reset
       the  machine which has gone down.  When the operator has reset the
       machine  he  or  she  then  issues  a command to tell the meatware
       STONITH  plug-in  that  the reset has taken place.  Heartbeat will
       wait  indefinitely  until  the operator acknowledges the reset has
       occurred.  During this time, the resources will not be taken over,
       and nothing will happen.
   20. STONITH is a form of fencing, and is an acronym standing for Shoot
       The  Other Node In The Head.  It allows one node in the cluster to
       reset  the  other.   Fencing  is  essential if you're using shared
       disks,  in  order  to  protect  the  integrity  of  the disk data.
       Heartbeat  supports  STONITH  fencing,  and  resources  which  are
       self-fencing.  You need to configure some kind of fencing whenever
       you  have a cluster resource which might be permanently damaged if
       both  machines  tried to make it active at the same time.  When in
       doubt check with the Linux-HA mailing list.
   21. To get the list of supported STONITH devices, issue this command:
       stonith -L
       To  get  all the gory details on exactly what these STONITH device
       names mean, and how to configure them, issue this command:
       stonith -h
   22. This  is not something which heartbeat supports directly, however,
       there are a few kinds of resources which are "self-fencing".  This
       means  that  activating the resource causes it to fence itself off
       from  the other node naturally.  Since this fencing happens in the
       resource  agent, heartbeat doesn't know (and doesn't have to know)
       about  it.  Two possible hardware candidates are IBM's ServeRAID-4
       RAID  controllers  and  ICP  Vortex RAID controllers - but do your
       homework!!!   When in doubt check with the mailing list.
   23. Yes,  heartbeat  has  supported active/active configurations since
       its  first  release. The key to configuring active/active clusters
       is  to understand that each resource group in the haresources file
       is  preceded  by the name of the server which is normally supposed
       to run that service. When in a "auto_failback yes (or legacy)" (or
       old-style  "nice_failback off") configuration, when a cluster node
       comes  up,  it will take over any resources for which it is listed
       as  the  "normal  master"  in  the  haresources  file. Below is an
       example of how to do this for an apache/mysql configuration.
server1 10.10.10.1 mysql
server2 10.10.10.2 apache
       In  this  case,  the IP address 10.10.10.1 should be replaced with
       the  IP  address  you  want  to  contact  the mysql server at, and
       10.10.10.2  should be replaced with the IP address you want people
       to  use to contact the web server. Any time server1 is up, it will
       run  the  mysql  service.  Any time server2 is up, it will run the
       apache  service.  If both server1 and server2 are up, both servers
       will  be  active.  Note  that  this  is contradictory with the old
       nice_failback  on  parameter.  With the new release which supports
       hb_standby   foreign,   you   can   manually  fail  back  into  an
       active/active  configuration  if  you have auto_failback off. This
       allows  administrators  more flexibility in failing back in a more
       customized way at more safe or convenient times.
   24. Heartbeat  was  written  to use ifconfig to manage its interfaces.
       That's  nice  for  portability  for  other platforms, but for some
       reasons  ifconfig  truncates interface names.  If you want to have
       fewer than 10 aliases, then you need to limit your interface names
       to 7 characters, and 6 for fewer than 100 interfaces.
   25. The   auto_failback   parameter  is  a  replacement  for  the  old
       nice_failback   parameter.  The  old  value  nice_failback  on  is
       replaced  by auto_failback off. The old value nice_failback off is
       logically  replaced  by the new auto_failback on parameter. Unlike
       the  old  nice_failback  off  behavior,  the  new auto_failback on
       allows the use of the ipfail and hb_standby facilities.
       During   upgrades  from  nice_failback  to  auto_failback,  it  is
       sometimes  necessary  to set auto_failback to legacy, as described
       in the [42]upgrade procedure below.
   26. To  upgrade  from  a pre-auto_failback version of heartbeat to one
       which   supports   auto_failback,  the  following  procedures  are
       recommended to avoid a flash cut on the whole cluster.
         1. Stop heartbeat on one node in the cluster.
         2. upgrade  this node. If the other node has nice_failback on in
            ha.cf  then  set  auto_failback off in the new ha.cf file. If
            the  other node in the cluster has nice_failback off then set
            auto_failback legacy in the new ha.cf file.
         3. Start the new version of heartbeat on this node.
         4. Stop heartbeat on the other node in the cluster.
         5. upgrade  this second node in the cluster with the new version
            of heartbeat. Set auto_failback the same as it was set in the
            previous step.
         6. Start heartbeat on this second node in the cluster.
         7. If  you  set  auto_failback  to on or off, then you are done.
            Congratulations!
         8. If  you  set  auto_failback  legacy  in your ha.cf file, then
            continue as described below...
         9. Schedule  a  time  to  shut down the entire cluster for a few
            seconds.
        10. At  the  scheduled  time, stop both nodes in the cluster, and
            then  change  the  value  of auto_failback to on in the ha.cf
            file on both sides.
        11. Restart both nodes on the cluster at about the same time.
        12. Congratulations, you're done! You can now use ipfail, and can
            also  use  the  hb_standby  command  to cause manual resource
            moves.
   27. Please  be  sure that you read all documentation and searched mail
       list  archives.  If  you  still can't find a solution you can post
       questions to the mailing list. Please include following:
          + What OS are you running.
          + What version (distro/kernel).
          + How did you install heartbeat (tar.gz, rpm, src.rpm or manual
            installation)
          + Include  your configuration files from BOTH machines. You can
            omit authkeys.
          + Include  the  parts  of  your logs which describe the errors.
            Send them as text/plain attachments.
            Please don't send "cleaned up" logs.  The real logs have more
            information in them than cleaned up versions.  Always include
            at least a little irrelevant data before and after the events
            in  question  so that we know nothing was missed.  Don't edit
            the   logs   unless   you   really   have  some  super-secret
            high-security reason for doing so.
            This means you need to attach 6 or 8 files. Include 6 if your
            debug  output  goes  into the same file as your normal output
            and 8 otherwise. For each machine you need to send:
               o ha.cf
               o haresources
               o normal logs
               o debug logs (perhaps)
   28. We love to get good patches.  Here's the preferred way:
          + If  you have any questions about the patch, please check with
            the linux-ha-dev mailing list for answers before starting.
          + Make your changes against the current CVS source
          + Test them, and make sure they work ;-)
          + Produce the patch this way:
                   cvs -q diff -u >patchname.txt
          + Send an email to the linux-ha-dev mailing list with the patch
            as  a [text/plain] attachment. If your mailer wants to zip it
            up for you, please fix it.
   ______________________________________________________________________

   Rev 0.0.8
   (c) 2000 Rudy Pawul [43]rpawul@iso-ne.com
   (c)  2001  Dusan  Djordjevic  [44]dj.dule@linux.org.yu  (c)  2003  IBM
   (Author Alan Robertson [45]alanr@unix.sh)

参照

   1. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#FAQ
   2. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#mailinglists
   3. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#what_is_it
   4. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#res_scr
   5. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#mon
   6. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#ipfail
   7. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#nettools
   8. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#manyIPs
   9. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#serial
  10. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#firewall
  11. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#nolocalheartbeat
  12. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#heavy_load
  13. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#serialerr
  14. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#serialerr
  15. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#authkeys
  16. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#authkeys
  17. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#authkeys
  18. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#multiple_clusters
  19. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#CVS
  20. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#other_os
  21. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#RPM
  22. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#meatware
  23. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#STONITH
  24. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#config_stonith
  25. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#self_fence
  26. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#active_active
  27. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#iftrunc
  28. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#why_auto_failback
  29. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#auto_failback_upgrade
  30. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#last_hope
  31. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#patches
  32. http://linux-ha.org/contact/
  33. http://www.linux-ha.org/
  34. http://www.beowulf.org/
  35. http://www.linuxvirtualserver.org/
  36. http://kernel.org/software/mon/
  37. http://pheared.net/devel/c/ipfail/
  38. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#heavy_load
  39. http://lists.community.tummy.com/pipermail/linux-ha-dev/1999-October/000212.html
  40. http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/
  41. http://www.packetfactory.net/libnet
  42. file://localhost/tmp/heartbeat-2.1.4-1/doc/faqntips.html#auto_failback_upgrade
  43. mailto:rpawul@iso-ne.com
  44. mailto:dj.dule@linux.org.yu
  45. mailto:alanr@unix.sh
