Error handling of Pacemaker Resource Agent_PHP tutorial

WBOY
Release: 2016-07-12 08:59:48
Original
1683 people have browsed it

Error handling of Pacemaker Resource Agent

1. Preface Pacemaker implements resource control by calling the operations (such as start, stop) provided by each resource agent. When this method execution error occurs, Pacemaker Different error handling occurs depending on the operation performed and the type of error.

2. Error types

Pacemaker divides errors into 3 categories: soft, hard and fatal. The latter two belong to environment or configuration problems and cannot be automatically repaired without manual intervention. General faults use OCF_ERR_GENERIC as the return value, such as service process crash, network failure, etc. OCF_ERR_GENERIC belongs to the soft type.


http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_how_are_ocf_return_codes_interpreted

B .3.How are OCF Return Codes Interpreted?

The first thing the cluster does is to check the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed and recovery action is initiated. There are three types of failure recovery:

TableB.3.Types of recovery performed by the cluster


Type Description Action Taken by the Cluster
soft A transient error occurred Restart the resource or move it to a new location
hard A non-transient error that may be specific to the current node occurred Move the resource elsewhere and prevent it from being retried on the current node
fatal A non-transient error that will be common to all cluster nodes (eg. a bad configuration was specified) Stop the resource and prevent it from being started on any cluster node



Assuming an action is considered to have failed, the following table outlines the different OCF return codes and the type of recovery the cluster will initiate when it is received.

B.4.OCF Return Codes

TableB.4.OCF Return Codes and their Recovery Types


RC OCF Alias Description RT
0 OCF_SUCCESS Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands. soft
1 OCF_ERR_GENERIC Generic "there was a problem" error code. soft
2 OCF_ERR_ARGS The resource’s configuration is not valid on this machine. Eg. refers to a location/tool not found on the node. hard
3 OCF_ERR_UNIMPLEMENTED The requested action is not implemented. hard
4 OCF_ERR_PERM The resource agent does not have sufficient privileges to complete the task. hard
5 OCF_ERR_INSTALLED The tools required by the resource are not installed on this machine. hard
6 OCF_ERR_CONFIGURED The resource’s configuration is invalid. Eg. required parameters are missing. fatal
7 OCF_NOT_RUNNING The resource is safely stopped. The cluster will not attempt to stop a resource that returns this for any action. N/A
8 OCF_RUNNING_MASTER The resource is running in Master mode. soft
9 OCF_FAILED_MASTER The resource is in Master mode but has failed. The resource will be demoted, stopped and then started (and possibly promoted) again. soft
other NA Custom error code. soft



Although counterintuitive, even actions that return 0 (aka.OCF_SUCCESS) can be considered to have failed.

3. Error handling

Each Resource operations have an on-fail attribute that controls how error handling is performed.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_monitoring_resources_for_failure


Table5.3. Properties of an Operation


Field Description
id Your name for the action. Must be unique.
name The action to perform. Common values: monitor, start, stop
interval How frequently (in seconds) to perform the operation. Default value: 0, meaning never.
timeout How long to wait before declaring the action has failed.
on-fail The action to take if this action ever fails. Allowed values:* ignore - Pretend the resource did not fail* block - Don’t perform any further operations on the resource* stop - Stop the resource and do not start it elsewhere* restart - Stop the resource and start it again (possibly on a different node)* fence - STONITH the node on which the resource failed* standby - Move all resources away from the node on which the resource failedThe default for the stop operation is fence when STONITH is enabled and block otherwise. All other operations default to stop.
enabled If false, the operation is treated as if it does not exist. Allowed values: true, false




However, after actual test verification, it was found that no matter how on-fail is set, the effect will not change, that is to say, it will always be Default behavior.

The following is the resource manager's processing when each operation of Resource Agent returns OCF_ERR_GENERIC:

Operation Error handling Corresponding on-fail value
start

Set fail-count=1000000

Call stop on this node

Start the resource on other nodes

restart
stop

Set fail-count=1000000

Prevent further operations on the resource, and the resource becomes unmanaged FAILED status, as follows

dummy(ocf::heartbeat:Dummy2):Started srdsdevapp69 (unmanaged) FAILED

block
monitor

Set fail-count =1

Call stop and start on this node in sequence , monitor. If the monitor still fails, repeat stop, start, and monitor until the fail-count reaches the migration-threshold, and keep the resource in the stopped state.


restart
promote

Set fail-count =1

Call demote, stop, start on this node in sequence.

Call promote on other nodes to promote the resources on other nodes to master

restart
demote

Set fail-count =1

Call stop, start, demote on this node in sequence. If the demote still fails, repeat stop, start, and demote until the fail-count reaches the migration-threshold, and keep the resource in the stopped state.

restart
notify ignore ignore

Note 1: The processing of timeout is the same as OCF_ERR_GENERIC

Note 2: Pacemaker will not call post stop notify for resources that have been stopped.

Note 3: Test environment Pacemaker 1.1.7-6, CentOS 6.3


4. Enlightenment

The above test on error handling The results can provide several inspirations to Resource Agent writers:

  1. 1. Do not let the stop operation return an error unless absolutely necessary
  2. 2. The judgment of monitor and start should be maintained Consistent, that is, there should not be a situation where the monitor fails to be executed immediately after the start is successful, otherwise it may cause a loop.
  3. 3. Executing demote after successful restart should not fail, otherwise it may cause a loop.
  4. 4. Set migration-threshold to a relatively small value (the default value is INFINITY, which is 100000), which can also reduce the impact of 2 and 3 above.

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/1096387.htmlTechArticleError handling of Pacemaker Resource Agent 1. Preface Pacemaker calls the operations provided by each resource agent (such as start, stop) Realize control over resources. When this method executes with an error...
Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template