Error handling of Pacemaker Resource Agent

1. Preface Pacemaker implements resource control by calling the operations (such as start, stop) provided by each resource agent. When this method execution error occurs, Pacemaker Different error handling occurs depending on the operation performed and the type of error.

2. Error types

Pacemaker divides errors into 3 categories: soft, hard and fatal. The latter two belong to environment or configuration problems and cannot be automatically repaired without manual intervention. General faults use OCF_ERR_GENERIC as the return value, such as service process crash, network failure, etc. OCF_ERR_GENERIC belongs to the soft type.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_how_are_ocf_return_codes_interpreted

B .3.How are OCF Return Codes Interpreted?

The first thing the cluster does is to check the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed and recovery action is initiated. There are three types of failure recovery:

TableB.3.Types of recovery performed by the cluster

Type	Description	Action Taken by the Cluster
soft	A transient error occurred	Restart the resource or move it to a new location
hard	A non-transient error that may be specific to the current node occurred	Move the resource elsewhere and prevent it from being retried on the current node
fatal	A non-transient error that will be common to all cluster nodes (eg. a bad configuration was specified)	Stop the resource and prevent it from being started on any cluster node

Assuming an action is considered to have failed, the following table outlines the different OCF return codes and the type of recovery the cluster will initiate when it is received.

B.4.OCF Return Codes

TableB.4.OCF Return Codes and their Recovery Types

RC	OCF Alias	Description	RT
0	OCF_SUCCESS	Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands.	soft
1	OCF_ERR_GENERIC	Generic "there was a problem" error code.	soft
2	OCF_ERR_ARGS	The resource’s configuration is not valid on this machine. Eg. refers to a location/tool not found on the node.	hard
3	OCF_ERR_UNIMPLEMENTED	The requested action is not implemented.	hard
4	OCF_ERR_PERM	The resource agent does not have sufficient privileges to complete the task.	hard
5	OCF_ERR_INSTALLED	The tools required by the resource are not installed on this machine.	hard
6	OCF_ERR_CONFIGURED	The resource’s configuration is invalid. Eg. required parameters are missing.	fatal
7	OCF_NOT_RUNNING	The resource is safely stopped. The cluster will not attempt to stop a resource that returns this for any action.	N/A
8	OCF_RUNNING_MASTER	The resource is running in Master mode.	soft
9	OCF_FAILED_MASTER	The resource is in Master mode but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again.	soft
other	NA	Custom error code.	soft

Although counterintuitive, even actions that return 0 (aka.OCF_SUCCESS) can be considered to have failed.

3. Error handling

Each Resource operations have an on-fail attribute that controls how error handling is performed.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_monitoring_resources_for_failure

Table5.3. Properties of an Operation

Field	Description
id	Your name for the action. Must be unique.
name	The action to perform. Common values: monitor, start, stop
interval	How frequently (in seconds) to perform the operation. Default value: 0, meaning never.
timeout	How long to wait before declaring the action has failed.
on-fail	The action to take if this action ever fails. Allowed values:* ignore - Pretend the resource did not fail* block - Don’t perform any further operations on the resource* stop - Stop the resource and do not start it elsewhere* restart - Stop the resource and start it again (possibly on a different node)* fence - STONITH the node on which the resource failed* standby - Move all resources away from the node on which the resource failedThe default for the stop operation is fence when STONITH is enabled and block otherwise. All other operations default to stop.
enabled	If false, the operation is treated as if it does not exist. Allowed values: true, false

However, after actual test verification, it was found that no matter how on-fail is set, the effect will not change, that is to say, it will always be Default behavior.

The following is the resource manager's processing when each operation of Resource Agent returns OCF_ERR_GENERIC:

Operation	Error handling	Corresponding on-fail value
start	Set fail-count=1000000 Call stop on this node Start the resource on other nodes	restart
stop	Set fail-count=1000000 Prevent further operations on the resource, and the resource becomes unmanaged FAILED status, as follows dummy(ocf::heartbeat:Dummy2):Started srdsdevapp69 (unmanaged) FAILED	block
monitor	Set fail-count =1 Call stop and start on this node in sequence , monitor. If the monitor still fails, repeat stop, start, and monitor until the fail-count reaches the migration-threshold, and keep the resource in the stopped state.	restart
promote	Set fail-count =1 Call demote, stop, start on this node in sequence. Call promote on other nodes to promote the resources on other nodes to master	restart
demote	Set fail-count =1 Call stop, start, demote on this node in sequence. If the demote still fails, repeat stop, start, and demote until the fail-count reaches the migration-threshold, and keep the resource in the stopped state.	restart
notify	ignore	ignore

Note 1: The processing of timeout is the same as OCF_ERR_GENERIC

Note 2: Pacemaker will not call post stop notify for resources that have been stopped.

Note 3: Test environment Pacemaker 1.1.7-6, CentOS 6.3

4. Enlightenment

The above test on error handling The results can provide several inspirations to Resource Agent writers:

1. Do not let the stop operation return an error unless absolutely necessary
2. The judgment of monitor and start should be maintained Consistent, that is, there should not be a situation where the monitor fails to be executed immediately after the start is successful, otherwise it may cause a loop.
3. Executing demote after successful restart should not fail, otherwise it may cause a loop.
4. Set migration-threshold to a relatively small value (the default value is INFINITY, which is 100000), which can also reduce the impact of 2 and 3 above.