With leaf endpoints, if an error occurs during a message transmission process, that message will be lost. The failed message will not be retried again. These errors occur very rarely, but still message failures can occur. With some applications these message losses are acceptable, but if even rare message failures are not acceptable, use the failover endpoint.
A Failover Group is a list of leaf endpoints grouped together for the purpose of passing an incoming message from one endpoint to another if a failover occurs. The first endpoint in failover group is considered the primary endpoint. An incoming message is first directed to the primary endpoint, and all other endpoints in the group serve as back-ups.
If the primary endpoint fails, the next active endpoint is selected as the primary endpoint, and the failed endpoint is marked as inactive. Thus, failover group ensures that a message is delivered as long as there is at least one active endpoint among the listed endpoints. The ESB switches back to the primary endpoint as soon as it becomes available. This behaviour is known as dynamic failover.
An endpoint failure occurs when an endpoint is unable to invoke a service. An endpoint, which responds with an error is not considered a failed endpoint.
XML configuration
Note
You can configure the Failover endpoint using XML. Click on Switch to source view link in the Failover Group page.
<failover> <endpoint .../>+ </failover>
UI configuration
The following page is opened by clicking Failover Group in the Add Endpoint tab of the Manage Endpoints page.
- Enter a name for the failover group endpoint, and if you want to add any properties, click Add Property and specify the properties.
- To add a child endpoint to the failover endpoint, click Add Child, and then select the required endpoint type from the list.
- Do the following:
- Enter the basic details for the child endpoint, such as the name and address.
- Click Show Advanced Options and specify the advanced options you want for this endpoint.
- To add properties to the child endpoint, click Add Property and specify the properties.
- Click Update.
- Add more child endpoints as needed, and then save the failover group endpoint.
Handling errors in failover endpoints
An endpoint can be in one of the following states: Active
, Timeout
or Suspended
. You can handle errors and configure the looping between the failover child endpoints via the <timeout>
, <suspendOnFailure>
and <markForSuspension>
properties based on these states. For more information about these states and properties, see Endpoint Error Handling.
Sample failover with one address endpoint
When message failure is not tolerable even though there is only one service endpoint, then failovers are possible with a single endpoint as shown in the below configuration.
<endpoint name="SampleFailover"> <failover> <endpoint name="Sample_First" statistics="enable" > <address uri="http://localhost/myendpoint" statistics="enable" trace="disable"> <timeout> <duration>60000</duration> </timeout> <markForSuspension> <errorCodes>101504, 101505, 101500</errorCodes> <retriesBeforeSuspension>3</retriesBeforeSuspension> <retryDelay>10</retryDelay> </markForSuspension> <suspendOnFailure> <initialDuration>1000</initialDuration> <progressionFactor>2</progressionFactor> <maximumDuration>64000</maximumDuration> </suspendOnFailure> </address> </endpoint> </failover> </endpoint>
In the above example, the Sample_First
endpoint is marked as Timeout
if a connection times out, closes, or sends IO errors after retrying for 60000
miliseconds.
When one of the errors of the specified codes occurs (i.e., 101504, 101505
and 101500)
, the failover will retry using the first non-suspended endpoint. In this case, it is the same endpoint (Sample_First
). It will retry until the retry count (i.e. 3 in the above example) becomes 0 with a delay as specified by the <retryDelay>
property (i.e., 10
miliseconds in the above example).
For all the other errors, it will be marked as Suspended
. For more information about these states and properties, see Endpoint Error Handling.
The retry count is per endpoint, not per message. The retry happens in parallel. Since messages come to this endpoint via many threads, the same message may not be retried three times. Another message may fail and can reduce the retry count.
In this configuration, we assume that these errors are rare and if they happen once in a while, it is okay to retry again. If they happen frequently and continuously, it means that it requires immediate attention to get it back to normal state.
Sample failover with multiple address endpoints
When a message reaches a failover endpoint with multiple address endpoints, it will go through its list of endpoints to pick the first one in Active
or Timeout
state (not in the Suspended
state). Then, it will send the message using that particular endpoint.
If a failure occurs with the first endpoint within the failover group and if this error does not put the first endpoint into Suspended
state, the retry will happen using the same endpoint.
However, if the first endpoint is suspended or if an error occurs while sending the message with the first endpoint, the failover endpoint will go through the endpoint list again from the beginning and will try to send the requests using the next endpoint, which is in the Active
or Timeout
state. Neverthless, when the first endpoint becomes ready to send again, it will try again on the first endpoint, even though the second endpoint is still active. For more information about these states and properties, see Endpoint Error Handling.
The following is an example failover endpoint configuration with multiple address endpoints.
<endpoint> <failover> <endpoint name="fooEP"> <http uri-template="http://localhost:8080/foo"> <timeout> <duration>10000</duration> <responseAction>fault</responseAction> </timeout> <suspendOnFailure> <errorCodes>101503,101504,101505,101507</errorCodes> <initialDuration>100</initialDuration> <progressionFactor>1.0</progressionFactor> <maximumDuration>30000</maximumDuration> </suspendOnFailure> </http> </endpoint> <endpoint name="barEP"> <http uri-template="http://localhost:8080/bar"> <timeout> <duration>10000</duration> <responseAction>fault</responseAction> </timeout> <suspendOnFailure> <errorCodes>101503,101504,101505,101507</errorCodes> <initialDuration>100</initialDuration> <progressionFactor>1.0</progressionFactor> <maximumDuration>30000</maximumDuration> </suspendOnFailure> <retryConfig> <disabledErrorCodes>101507,101504</disabledErrorCodes> </retryConfig> </http> </endpoint> </failover> </endpoint>
The <retryConfig>
property configures the last child endpoint to stop retying by ending the loop (i.e. to make the endpoint respond back to the service), after attempting to send requests to all the child endpoints and when all the attempts fail.
<retryConfig> <disabledErrorCodes>101507,101504</disabledErrorCodes> </retryConfig>
Thus, in the above configuration, erros of the codes 101504
and 101507
stop retrying and put the endpoint to the Timeout
state.