Activity Failure and Recovery

There are several types of error conditions, out of which, a class of error conditions called failures that are distinct from faults are described here, along with how they are caught and handled by the process engine.

A service returns a fault in response to a request it cannot process. A process may also raise a fault internally when it encounters a terminal error condition such as a faulty expression or a false join condition for example. In addition, processes may raise faults in order to terminate normal processing. In contrast to faults, a failures is a non-terminal error condition that does not affect the normal flow of the process. The process definition is kept simple and straightforward by delegating failure handling to the process engine and administrator.

For example, when the process is unable to perform DNS resolution to determine the service endpoint, it generates a failure. An administrator can fix the DNS server and tell the process engine to retry the activity. Had the DNS error been reported as a fault, the process would either terminate or require complex fault handling and recovery logic to proceed past this point of failure.

In short, failures shield the process from common, non-terminal error conditions while retaining simple and straightforward process definitions that do not need to account for these error conditions.

From Failure to Recovery

Currently, the Invoke activity is the only activity that supports failure handling and recovery. The mechanism is identical for all other activities that may support failure handling and recovery in the future.

In case of the Invoke activity, a failure condition is triggered by the integration layer, in lieu of a response or fault message. The Invoke activity consults its failure handling policy and decides how to respond.

1. Set "faultOnFailure" value to "yes", if you want the activity to throw a fault on failure. All other failure handling settings are ignored and the activity throws the "activityFailure" fault. The "activityFailure" fault is a standard fault. Therefore, the "exitOnStandardFault" attribute can be used to control whether the process exits immediately, or throws a fault in the enclosing scope.

2. Set "retryFor" to a positive integer to enable the activity to attempt self-recovery and retry up to that number of times.

3. Set "retryDelay" to a reasonable time delay (specified in seconds) between retries. For example, if retryFor=2 and retryDelay=30, the activity will retry after 30 and 60 seconds for a total of three attempts before entering activity recovery mode. If the activity retries and succeeds, it completes successfully as if no failure occurred. It is also possible that the activity may retry and fault; for example when the invoked service returns a fault. If the activity has exhausted all retry attempts, it enters activity recovery mode. By default "retryFor" is set to zero, and the activity enters recovery mode after the first failure.

When in recovery mode, you can recover the activity in any of the following three methods.

Retry : Retry the activity manually. This can be repeated any number of times until the activity completes or faults.
Fault : Causes the activity to throw the "activityFailure" fault.
Cancel : Cancels the activity. The activity completes unsuccessfully, without changing the state of variables, by setting the status of all its source links to false, and without installing a compensation handler.

Activity recovery is performed individually for each activity instance, and does not affect other activities executing in the same process. While the activity is in the FAILURE state, the process instance remains in the ACTIVE state and may execute other activities from parallel flows and event handlers.

Specifying Failure Behaviour

Use the "failureHandling" extensibility element defined in the namespace http://ode.apache.org/activityRecovery. The structure of the "failureHandling" element is:

<ext:failureHandlingxmlns:ext="http://ode.apache.org/activityRecovery">
    <ext:faultOnFailure> _boolean_ </ext:faultOnFailure>
    <ext:retryFor> _integer_ </ext:retryFor>
    <ext:retryDelay> _integer_ </ext:retryDelay>
</ext:failureHandling>

The "faultOnFailure", "retryFor" and "retryDelay" elements are optional. The default value for "faultOnFailure" is false, and zero for "retryFor" and "retryDelay". An activity that does not specify failure handling using this extensibility element, inherits the failure handling policy of its parent activity, recursively up to the top-level activity of the process. You can use inheritance to specify the failure handling policy of a set of activities, or all activities in the process, using a single "failureHandling" extensibility element.

Note that due to this behavior, if activity S specifies failure handling with the values retryFor=2, retryDelay=60, and has a child activity R that specifies failure handling with the values retryFor=3, the "retryDelay" value for the child activity R is 0, and not 60. Use of the "failureHandling" element without specifying one of its value elements applies the default value for that element.

Example

A simple invoke with the ext:failureHandling extension is shown below.

<bpel:invoke inputVariable="myRequest"operation="foo"outputVariable="aResponse"partnerLink="myPartner"portType="spt:SomePortType">  
    <ext:failureHandlingxmlns:ext="http://ode.apache.org/activityRecovery">
       <ext:faultOnFailure>false</ext:faultOnFailure>
       <ext:retryFor>2</ext:retryFor>
       <ext:retryDelay>60</ext:retryDelay>
    </ext:failureHandling>
</bpel:invoke>

And a sequence activity that converts failures into faults:
<bpel:sequence>
   <ext:failureHandlingxmlns:ext="http://ode.apache.org/activityRecovery">
   <ext:faultOnFailure>true</ext:faultOnFailure>
   </ext:failureHandling>

   ...

   <bpel:invokeinputVariable="myRequest"operation="foo"outputVariable="aResponse"partnerLink="myPartner"portType="spt:SomePortType">   
      <bpel:catchAll>
            ...
      </bpel:catchAll>
   </bpel:invoke>
</bpel:sequence>

Process Instance Management

The process instance management provides the following information:

1. Process instance summary includes:

A failure element with a count of the total number of process instances that have one or more activities in recovery mode.
The date/time of the last activity to enter recovery mode.
The element exists if at least one activity is in recovery mode.

2. Process instance information includes:

A failure element with a count of the number of activities in recovery mode.
The date/time of the last activity to enter recovery mode.
The element exists if at least one activity is in recovery mode.

3. Activity instance information includes:

A failure element that specifies the date/time of the failure, the reason for the failure, number of retries, and list of available recovery actions.
The element exists if the activity is in the state FAILURE.

Use the "recoverActivity" operation to perform a recovery action on an activity in recovery mode. The operation requires the process instance ID, the activity instance ID and the recovery action to perform (one of retry, fault or cancel). The execution log also can be used to determine when failure or recovery occurred for a given activity instance.