
Heartbeat reporting

   Dejan Muhamedagic
   <[1]dmuhamedagic@suse.de>
   v1.0

   hb_report  is  a  utility  to  collect  all  information  relevant  to
   Heartbeat over the given period of time.

Quick start

   Run  hb_report  on  one  of the nodes or on the host which serves as a
   central log server. Run hb_report without parameters to see usage.

   A few examples:
    1. Last   night   during  the  backup  there  were  several  warnings
       encountered (logserver is the log host):
logserver# hb_report -f 3:00 -t 4:00 /tmp/report
       collects everything from all nodes from 3am to 4am last night. The
       files  are  stored  in  /tmp/report  and  compressed  to a tarball
       /tmp/report.tar.gz.
    2. Just found a problem during testing:
node1# date : note the current time
node1# /etc/init.d/heartbeat start
node1# nasty_command_that_breaks_things
node1# sleep 120 : wait for the cluster to settle
node1# hb_report -f time /tmp/hb1

Introduction

   Managing  clusters  is  cumbersome.  Heartbeat  v2  with  its numerous
   configuration   files   and  multi-node  clusters  just  adds  to  the
   complexity.  No  wonder  then that most problem reports were less than
   optimal.  This  is  an attempt to rectify that situation and make life
   easier for both the users and the developers.

On security

   hb_report  is  a  fairly  complex program. As some of you are probably
   going to run it as root let us state a few important things you should
   keep in mind:
    1. Don't  run  hb_report as root! It is fairly simple to setup things
       in  such  a  way  that  root access is not needed. I won't go into
       details,  just  to stress that all information collected should be
       readable by accounts belonging the haclient group.
    2. If  you  still  have  to  run this as root. Well, don't use the -C
       option.
    3. Of course, every possible precaution has been taken not to disturb
       processes,  or  touch or remove files out of the given destination
       directory.  If  you  (by  mistake)  specify an existing directory,
       hb_report  will  bail out soon. If you specify a relative path, it
       won't work either.

   The  final product of hb_report is a tarball. However, the destination
   directory is not removed on any node, unless the user specifies -C. If
   you're  too lazy to cleanup the previous run, do yourself a favour and
   just  supply  a  new destination directory. You've been warned. If you
   worry  about  the space used, just put all your directories under /tmp
   and setup a cronjob to remove those directories once a week:
        for d in /tmp/*; do
                test -d $d ||
                        continue
                test -f $d/description.txt || test -f $d/.env ||
                        continue
                grep -qs 'By: hb_report' $d/description.txt ||
                        grep -qs '^UNIQUE_MSG=Mark' $d/.env ||
                        continue
                rm -r $d
        done

Mode of operation

   Cluster   data  collection  is  straightforward:  just  run  the  same
   procedure  on  all nodes and collect the reports. There is, apart from
   many  small  ones, one large complication: central syslog destination.
   So,  in order to allow this to be fully automated, we should sometimes
   run  the  procedure  on  the log host too. Actually, if there is a log
   host, then the best way is to run hb_report there.

   We  use  ssh  for  the  remote  program  invocation. Even though it is
   possible  to run hb_report without ssh by doing a more menial job, the
   overall  user experience is much better if ssh works. Anyway, how else
   do you manage your cluster?

   Another  ssh  related  point:  In case your security policy proscribes
   loghost-to-cluster-over-ssh  communications,  then you'll have to copy
   the log file to one of the nodes and point hb_report to it.

Prerequisites

    1. ssh
       This  is  not  strictly  required,  but  you won't regret having a
       password-less  ssh. It is not too difficult to setup and will save
       you  a lot of time. If you can't have it, for example because your
       security  policy  does  not allow such a thing, or you just prefer
       menial  work,  then  you  will  have  to resort to the semi-manual
       semi-automated report generation. See below for instructions.
       If  you  need to supply a password for your passphrase/login, then
       please use the -u option.
    2. Times
       In  order  to  find  files and messages in the given period and to
       parse  the  -f  and -t options, hb_report uses perl and one of the
       Date::Parse  or  Date::Manip perl modules. Note that you need only
       one  of  these. Furthermore, on nodes which have no logs and where
       you don't run hb_report directly, no date parsing is necessary. In
       other  words,  if  you  run  this on a loghost then you don't need
       these perl modules on the cluster nodes.
       On   rpm   based   distributions,  you  can  find  Date::Parse  in
       perl-TimeDate    and    on   Debian   and   its   derivatives   in
       libtimedate-perl.
    3. Core dumps
       To  backtrace  core dumps gdb is needed and the Heartbeat packages
       with  the debugging info. The debug info packages may be installed
       at  the  time the report is created. Let's hope that you will need
       this really seldom.

What is in the report

    1. Heartbeat related
          + heartbeat version/release information
          + heartbeat configuration (CIB, ha.cf, logd.cf)
          + heartbeat status (output from crm_mon, crm_verify, ccm_tool)
          + pengine transition graphs (if any)
          + backtraces of core dumps (if any)
          + heartbeat logs (if any)
    2. System related
          + general platform information (uname, arch, distribution)
          + system statistics (uptime, top, ps, netstat -i, arp)
    3. User created :)
          + problem description (template to be edited)
    4. Generated
          + problem analysis (generated)

   It  is  preferred  that  the  Heartbeat  is running at the time of the
   report,  but  not  absolutely required. hb_report will also do a quick
   analysis of the collected information.

Times

   Specifying  times  can  at  times  be  a nuisance. That is why we have
   chosen  to  use one of the perl modules--they do allow certain freedom
   when  talking  dates.  You  can  either  read  the instructions at the
   [2]Date::Parse examples page.

   or just rely on common sense and try stuff like:
3:00          (today at 3am)
15:00         (today at 3pm)
2007/9/1 2pm  (September 1st at 2pm)

   hb_report  will (probably) complain if it can't figure out what do you
   mean.

   Try  to  delimit the event as close as possible in order to reduce the
   size  of the report, but still leaving a minute or two around for good
   measure.

   Note  that  -f  is  not  an optional option. And don't forget to quote
   dates when they contain spaces.

   It is also possible to extract a CTS test. Just prefix the test number
   with cts: in the -f option.

Should I send all this to the rest of Internet?

   We  make  an  effort  to  remove  sensitive  data  from  the Heartbeat
   configuration  (CIB,  ha.cf, and transition graphs). However, you have
   to  tell us what is sensitive! Use the -p option to specify additional
   regular   expressions  to  match  variable  names  which  may  contain
   information you don't want to leak. For example:
# hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report

   We  look  by  default  for  variable  names  matching "pass.*" and the
   stonith_host ha.cf directive.

   Logs  and other files are not filtered. Please filter them yourself if
   necessary.

Logs

   It  may  be  tricky  to  find syslog logs. The scheme used is to log a
   unique  message  on  all nodes and then look it up in the usual syslog
   locations.  This  procedure  is  not  foolproof,  in particular if the
   syslog  files  are  in  a  non-standard directory. We look in /var/log
   /var/logs  /var/syslog  /var/adm /var/log/ha /var/log/cluster. In case
   we can't find the logs, please supply their location:
# hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1

   If  you have different log locations on different nodes, well, perhaps
   you'd like to make them the same and make life easier for everybody.

   The  log  files are collected from all hosts where found. In case your
   syslog is configured to log to both the log server and local files and
   hb_report  is run on the log server you will end up with multiple logs
   with same content.

   Files starting with "ha-" are preferred. In case syslog sends messages
   to  more  than  one  file,  if one of them is named ha-log or ha-debug
   those will be favoured to syslog or messages.

   If there is no separate log for Heartbeat, possibly unrelated messages
   from  other  programs  are included. We don't filter logs, just pick a
   segment for the period you specified.

   NB: Don't have a central log host? Read the CTS README and setup one.

Manual report collection

   So,  your  ssh  doesn't  work. In that case, you will have to run this
   procedure on all nodes. Use -S so that we don't bother with ssh:
# hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1

   If  you  also have a log host which is not in the cluster, then you'll
   have to copy the log to one of the nodes and tell us where it is:
# hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1

   Furthermore,  to  prevent hb_report from asking you to edit the report
   to describe the problem on every node use -D on all but one:
# hb_report -f 5:20pm -t 5:30pm -DS /tmp/report_node1

   If  you  reconsider  and  want  the  ssh setup, take a look at the CTS
   README file for instructions.

Analysis

   The  point  of  analysis  is to get out the most important information
   from  probably  several  thousand  lines  worth  of text. Perhaps this
   should be more properly named as report review as it is rather simple,
   but let's pretend that we are doing something utterly sophisticated.

   The analysis consists of the following:
     * compare files coming from different nodes; if they are equal, make
       one copy in the top level directory, remove duplicates, and create
       soft links instead
     * print errors, warnings, and lines matching -L patterns from logs
     * report if there were coredumps and by whom
     * report crm_verify results

The goods

    1. Common
          + ha-log (if found on the log host)
          + description.txt (template and user report)
          + analysis.txt
    2. Per node
          + ha.cf
          + logd.cf
          + ha-log (if found)
          + cib.xml (cibadmin -Ql or cp if Heartbeat is not running)
          + ccm_tool.txt (ccm_tool -p)
          + crm_mon.txt (crm_mon -1)
          + crm_verify.txt (crm_verify -V)
          + pengine/ (only on DC, directory with pengine transitions)
          + sysinfo.txt (static info)
          + sysstats.txt (dynamic info)
          + backtraces.txt (if coredumps found)
          + DC (well...)
          + RUNNING or STOPPED

   Last updated 29-Nov-2007 16:12:02 CEST

参照

   1. mailto:dmuhamedagic@suse.de
   2. http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES
