Pacemaker divides errors into 3 categories: soft, hard and fatal. The latter two belong to environment or configuration problems and cannot be automatically repaired without manual intervention. General faults use OCF_ERR_GENERIC as the return value, such as service process crash, network failure, etc. OCF_ERR_GENERIC belongs to the soft type.
TableB.3.Types of recovery performed by the cluster
Type | Description | Action Taken by the Cluster |
---|---|---|
soft | A transient error occurred | Restart the resource or move it to a new location |
hard | A non-transient error that may be specific to the current node occurred | Move the resource elsewhere and prevent it from being retried on the current node |
fatal | A non-transient error that will be common to all cluster nodes (eg. a bad configuration was specified) | Stop the resource and prevent it from being started on any cluster node |
TableB.4.OCF Return Codes and their Recovery Types
RC | OCF Alias | Description | RT |
---|---|---|---|
0 | OCF_SUCCESS | Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands. | soft |
1 | OCF_ERR_GENERIC | Generic "there was a problem" error code. | soft |
2 | OCF_ERR_ARGS | The resource’s configuration is not valid on this machine. Eg. refers to a location/tool not found on the node. | hard |
3 | OCF_ERR_UNIMPLEMENTED | The requested action is not implemented. | hard |
4 | OCF_ERR_PERM | The resource agent does not have sufficient privileges to complete the task. | hard |
5 | OCF_ERR_INSTALLED | The tools required by the resource are not installed on this machine. | hard |
6 | OCF_ERR_CONFIGURED | The resource’s configuration is invalid. Eg. required parameters are missing. | fatal |
7 | OCF_NOT_RUNNING | The resource is safely stopped. The cluster will not attempt to stop a resource that returns this for any action. | N/A |
8 | OCF_RUNNING_MASTER | The resource is running in Master mode. | soft |
9 | OCF_FAILED_MASTER | The resource is in Master mode but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again. | soft |
other | NA | Custom error code. | soft |
Each Resource operations have an on-fail attribute that controls how error handling is performed.
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_monitoring_resources_for_failure
Table5.3. Properties of an Operation
Field | Description |
---|---|
id | Your name for the action. Must be unique. |
name | The action to perform. Common values: monitor, start, stop |
interval | How frequently (in seconds) to perform the operation. Default value: 0, meaning never. |
timeout | How long to wait before declaring the action has failed. |
on-fail | The action to take if this action ever fails. Allowed values:* ignore - Pretend the resource did not fail* block - Don’t perform any further operations on the resource* stop - Stop the resource and do not start it elsewhere* restart - Stop the resource and start it again (possibly on a different node)* fence - STONITH the node on which the resource failed* standby - Move all resources away from the node on which the resource failedThe default for the stop operation is fence when STONITH is enabled and block otherwise. All other operations default to stop. |
enabled | If false, the operation is treated as if it does not exist. Allowed values: true, false |
However, after actual test verification, it was found that no matter how on-fail is set, the effect will not change, that is to say, it will always be Default behavior.
The following is the resource manager's processing when each operation of Resource Agent returns OCF_ERR_GENERIC:
Operation | Error handling | Corresponding on-fail value |
---|---|---|
start |
Set fail-count=1000000 Call stop on this node Start the resource on other nodes |
restart |
stop |
Set fail-count=1000000 Prevent further operations on the resource, and the resource becomes unmanaged FAILED status, as follows dummy(ocf::heartbeat:Dummy2):Started srdsdevapp69 (unmanaged) FAILED |
block |
monitor |
Set fail-count =1 Call stop and start on this node in sequence , monitor. If the monitor still fails, repeat stop, start, and monitor until the fail-count reaches the migration-threshold, and keep the resource in the stopped state. |
restart |
promote |
Set fail-count =1 Call demote, stop, start on this node in sequence. Call promote on other nodes to promote the resources on other nodes to master |
restart |
demote | Set fail-count =1 Call stop, start, demote on this node in sequence. If the demote still fails, repeat stop, start, and demote until the fail-count reaches the migration-threshold, and keep the resource in the stopped state. |
restart |
notify | ignore | ignore |
Note 1: The processing of timeout is the same as OCF_ERR_GENERIC
Note 2: Pacemaker will not call post stop notify for resources that have been stopped.
Note 3: Test environment Pacemaker 1.1.7-6, CentOS 6.3
The above test on error handling The results can provide several inspirations to Resource Agent writers: