:compat-mode: legacy
= Configure STONITH =

== What is STONITH? ==

STONITH (Shoot The Other Node In The Head aka. fencing) protects your data from
being corrupted by rogue nodes or unintended concurrent access.

Just because a node is unresponsive doesn't mean it has stopped
accessing your data. The only way to be 100% sure that your data is
safe, is to use STONITH to ensure that the node is truly
offline before allowing the data to be accessed from another node.

STONITH also has a role to play in the event that a clustered service
cannot be stopped. In this case, the cluster uses STONITH to force the
whole node offline, thereby making it safe to start the service
elsewhere.

== Choose a STONITH Device ==

It is crucial that your STONITH device can allow the cluster to
differentiate between a node failure and a network failure.

A common mistake people make when choosing a STONITH device is to use a remote
power switch (such as many on-board IPMI controllers) that shares power with
the node it controls. If the power fails in such a case, the cluster cannot be
sure whether the node is really offline, or active and suffering from a network
fault, so the cluster will stop all resources to avoid a possible split-brain
situation.

Likewise, any device that relies on the machine being active (such as
SSH-based "devices" sometimes used during testing) is inappropriate.

== Configure the Cluster for STONITH ==

. Install the STONITH agent(s). To see what packages are available, run `yum
  search fence-`. Be sure to install the package(s) on all cluster nodes.

. Configure the STONITH device itself to be able to fence your nodes and accept
  fencing requests. This includes any necessary configuration on the device and
  on the nodes, and any firewall or SELinux changes needed. Test the
  communication between the device and your nodes.

. Find the correct STONITH agent script: `pcs stonith list`

. Find the parameters associated with the device: +pcs stonith describe pass:[<replaceable>agent_name</replaceable>]+

. Create a local copy of the CIB: `pcs cluster cib stonith_cfg`

. Create the fencing resource: +pcs -f stonith_cfg stonith create pass:[<replaceable>stonith_id
  stonith_device_type &#91;stonith_device_options&#93;</replaceable>]+
+
Any flags that do not take arguments, such as +--ssl+, should be passed as +ssl=1+.

. Enable STONITH in the cluster: `pcs -f stonith_cfg property set stonith-enabled=true`

. If the device does not know how to fence nodes based on their uname,
  you may also need to set the special *pcmk_host_map* parameter.  See
  `man pacemaker-fenced` for details.

. If the device does not support the *list* command, you may also need
  to set the special *pcmk_host_list* and/or *pcmk_host_check*
  parameters.  See `man pacemaker-fenced` for details.

. If the device does not expect the victim to be specified with the
  *port* parameter, you may also need to set the special
  *pcmk_host_argument* parameter. See `man pacemaker-fenced` for details.

. Commit the new configuration: `pcs cluster cib-push stonith_cfg`

. Once the STONITH resource is running, test it (you might want to stop
  the cluster on that machine first): +stonith_admin --reboot pass:[<replaceable>nodename</replaceable>]+

== Example ==

For this example, assume we have a chassis containing four nodes
and an IPMI device active on 10.0.0.1. Following the steps above
would go something like this:

Step 1: Install the *fence-agents-ipmilan* package on both nodes.

Step 2: Configure the IP address, authentication credentials, etc. in the IPMI device itself.

Step 3: Choose the *fence_ipmilan* STONITH agent.

Step 4: Obtain the agent's possible parameters:
----
[root@pcmk-1 ~]# pcs stonith describe fence_ipmilan
fence_ipmilan - Fence agent for IPMI

fence_ipmilan is an I/O Fencing agentwhich can be used with machines controlled by IPMI.This agent calls support software ipmitool (http://ipmitool.sf.net/). WARNING! This fence agent might report success before the node is powered off. You should use -m/method onoff if your fence device works correctly with that option.

Stonith options:
  ipport: TCP/UDP port to use for connection with device
  hexadecimal_kg: Hexadecimal-encoded Kg key for IPMIv2 authentication
  port: IP address or hostname of fencing device (together with --port-as-ip)
  inet6_only: Forces agent to use IPv6 addresses only
  ipaddr: IP Address or Hostname
  passwd_script: Script to retrieve password
  method: Method to fence (onoff|cycle)
  inet4_only: Forces agent to use IPv4 addresses only
  passwd: Login password or passphrase
  lanplus: Use Lanplus to improve security of connection
  auth: IPMI Lan Auth type.
  cipher: Ciphersuite to use (same as ipmitool -C parameter)
  target: Bridge IPMI requests to the remote target address
  privlvl: Privilege level on IPMI device
  timeout: Timeout (sec) for IPMI operation
  login: Login Name
  verbose: Verbose mode
  debug: Write debug information to given file
  power_wait: Wait X seconds after issuing ON/OFF
  login_timeout: Wait X seconds for cmd prompt after login
  delay: Wait X seconds before fencing is started
  power_timeout: Test X seconds for status change after ON/OFF
  ipmitool_path: Path to ipmitool binary
  shell_timeout: Wait X seconds for cmd prompt after issuing command
  port_as_ip: Make "port/plug" to be an alias to IP address
  retry_on: Count of attempts to retry power on
  sudo: Use sudo (without password) when calling 3rd party sotfware.
  priority: The priority of the stonith resource. Devices are tried in order of highest priority to lowest.
  pcmk_host_map: A mapping of host names to ports numbers for devices that do not support host names. Eg. node1:1;node2:2,3 would tell the cluster to use port 1 for node1 and ports 2 and
                 3 for node2
  pcmk_host_list: A list of machines controlled by this device (Optional unless pcmk_host_check=static-list).
  pcmk_host_check: How to determine which machines are controlled by the device. Allowed values: dynamic-list (query the device), static-list (check the pcmk_host_list attribute), none
                   (assume every device can fence every machine)
  pcmk_delay_max: Enable a random delay for stonith actions and specify the maximum of random delay. This prevents double fencing when using slow devices such as sbd. Use this to enable a
                  random delay for stonith actions. The overall delay is derived from this random delay value adding a static delay so that the sum is kept below the maximum delay.
  pcmk_delay_base: Enable a base delay for stonith actions and specify base delay value. This prevents double fencing when different delays are configured on the nodes. Use this to enable
                   a static delay for stonith actions. The overall delay is derived from a random delay value adding this static delay so that the sum is kept below the maximum delay.
  pcmk_action_limit: The maximum number of actions can be performed in parallel on this device Pengine property concurrent-fencing=true needs to be configured first. Then use this to
                     specify the maximum number of actions can be performed in parallel on this device. -1 is unlimited.

Default operations:
  monitor: interval=60s
----

Step 5: `pcs cluster cib stonith_cfg`

Step 6: Here are example parameters for creating our STONITH resource:
----
[root@pcmk-1 ~]# pcs -f stonith_cfg stonith create ipmi-fencing fence_ipmilan \
      pcmk_host_list="pcmk-1 pcmk-2" ipaddr=10.0.0.1 login=testuser \
      passwd=acd123 op monitor interval=60s
[root@pcmk-1 ~]# pcs -f stonith_cfg stonith
 ipmi-fencing	(stonith:fence_ipmilan):	Stopped 
----

Steps 7-10: Enable STONITH in the cluster:
----
[root@pcmk-1 ~]# pcs -f stonith_cfg property set stonith-enabled=true
[root@pcmk-1 ~]# pcs -f stonith_cfg property
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: mycluster
 dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
 have-watchdog: false
 stonith-enabled: true
----

Step 11: `pcs cluster cib-push stonith_cfg --config`

Step 12: Test:
----
[root@pcmk-1 ~]# pcs cluster stop pcmk-2
[root@pcmk-1 ~]# stonith_admin --reboot pcmk-2
----

After a successful test, login to any rebooted nodes, and start the cluster
(with `pcs cluster start`).
