Endpoint Error Handling

Note

The article deals with the ESB 3.0. Do not forget to change configurations according to your version.

The last step of a message processing inside WSO2 Enterprise Service Bus is to send the message out to a service provider (see also Message Mediation). As far as the WSO2 Enterprise Service Bus concerned, it sends the message to a listening service endpoint. The message sends from the WSO2 Enterprise Service Bus to the service can be very different from the incoming message.

Since endpoints send the message out, they can encounter various transport errors. For example, connection may time out, or connection may be closed by the actual service.

So endpoint error handling is a key part of any successful Enterprise Service Bus deployment. Messages can fail or lost due to various reasons in a real TCP network. Usually when an error occurs and if the WSO2 Enterprise Service Bus is not configured to accept the error, it will mark the endpoint as a failure. This leads to a message failure. By default, endpoint will be marked as failed for a quite a long time. And due to this error, subsequent messages may get lost.

Handling errors at the endpoint level is crucial to any successful deployment. Errors are bound to be discovered by running tests. So it is recommended to run few long running load tests and fine tune the endpoint configurations for errors that can occur intermittently due to various reasons.

WSO2 Enterprise Service Bus endpoint has the configurations to specify its behavior on error conditions, which might occur between WSO2 Enterprise Service Bus and the actual service endpoint.

Endpoint States

At any given time, the state of the endpoint can be "Active," "Timeout," "Suspended" or "OFF." The endpoint state transition normally happens on a message basis. To put an endpoint in to OFF state, you need to use JMX, so that the state is not in the State Transition Diagram.

Endpoint states in detail:

State	Description
#Active	Endpoint is up and running.
#Timeout	Endpoint encountered an error, it is a candidate for suspension. If it continues to encounter errors, it will be suspended. It can still send messages.
#Suspended	Endpoint encountered errors and is sent to a state where it cannot send requests. It cannot send messages and messages coming to it will result in a fault.
OFF	Endpoint is not active.

Active

When WSO2 Enterprise Service Bus boots up, endpoints are in the "Active" state and ready to send messages. If the user does not put the endpoint into OFF state, it will be in the "Active" state until an error occurs.

When an error occurs, the endpoint can be configured to stay in "Active" or to go to "Timeout" or "Suspended" state. Every error has an error code. Endpoint configuration allows you to define the errors to put the endpoint into "Timeout" and "Suspension" modes. If a particular error is not defined for "Timeout" or "Suspended" states, the error will be ignored.

So errors are handled in three ways:

Put the endpoint into the "Suspended" state.
Put the endpoint into the"Timeout" state.
Ignore and stay in the "Active" state.

If the specific error does not have a specified time out, then the Connection Close will be treated as "Timeout" errors. All the other errors will put the endpoint into a "Suspended" state.

When an error occurs, endpoint will first try to see whether it is an error for putting the endpoint into "Timeout." Then it will check whether it is an error for putting the endpoint in to the "Suspended" state.

Timeout

In this state endpoint can forward messages bound to a maximum number of continues failures. If it continuously fails and the maximum number exceeds, endpoint will be marked as "Suspended." If one message succeed, the endpoint will be marked as "Active."

For example, let's assume number of tries is set to 3. When an error occurs and endpoint is set to this state, we have three tries. If the next three messages are sent using this endpoint, and encounters then the an error, the endpoint will be put to the "Suspended" state. If one of the messages succeeds before putting the endpoint into "Suspended" state, the endpoint will be marked as "Active."

Suspended

A "Suspended" endpoint cannot be used for sending the messages. After endpoint is put in to this state, it can be tried again after a configurable time. After this time period expires, WSO2 Enterprise Service Bus will try to forward messages from this endpoint. If the message succeeds, then WSO2 Enterprise Service Bus will mark the endpoint as "Active." If the next message fails, the endpoint will be put to "Suspended" or "Timeout" state depending on the error.

The next period is calculated using the following formula:

Next suspension time period = Max (Initial Suspension duration * (progression factor try count), Maximum Duration)

All the variables in the above formula are configuration values used to calculate the try count. Try count means, how many tries occurred after the endpoint is "Suspended." As the try count increases, the next Suspension time period will also increase. This increase is bound to a maximum duration.

Leaf Endpoint Configurations

This is the configuration for the address endpoint. Since we all are only interested in error configurations, the same applies for WSDL endpoint as well. The error handling configuration are as follows:

<address uri="endpoint address" [format="soap11|soap12|pox|get"]
    [optimize="mtom|swa"] [encoding="charset encoding"]
    [statistics="enable|disable"] [trace="enable|disable"]>
	<enableRM [policy="key"]/>?
        <enableSec [policy="key"]/>?
        <enableAddressing [version="final|submission"] [separateListener="true|false"]/>?

        <timeout>
                <duration>timeout duration in seconds</duration>
                <responseAction>discard|fault</responseAction>
        </timeout>?

        <markForSuspension>
                [<errorCodes>xxx,yyy</errorCodes>]
                <retriesBeforeSuspension>m</retriesBeforeSuspension>
                <retryDelay>d</retryDelay>
        </markForSuspension>

        <suspendOnFailure>
	        [<errorCodes>xxx,yyy</errorCodes>]
                <initialDuration>n</initialDuration>
                <progressionFactor>r</progressionFactor>
                <maximumDuration>l</maximumDuration>
        </suspendOnFailure>
</address>

"Timeout" Settings

Name	Values	Default	Description
duration	Miliseconds	60000	Connection timeout interval. If a the remote endpoint does not respond in this time, it will be treated as a "Timeout."
responseAction	discard, fault, none	none	When a response comes to a timed out request, weather to discard it or invoke the fault handler.

"MarkForSuspension" Settings

Name	Values	Default	Description
errorCodes	Comma separated list of error codes	101504, 101505	Errors to send the endpoint into the "Timeout" state `retriesBeforeSuspension`.
retriesBeforeSuspension	Integer	0	In the "Timeout" state this number of requests minus one can be tried and can be failed before endpoint is marked as "Suspended" `retryDelay`. This setting is a per endpoint setting. It is not a per message setting. So several messages can be tried in parallel and fail and the remaining retries will be reduced.
retryDelay

'suspenOnFailure' settings

Name	Values	Default	Description
errorCodes	Comma separated list of error codes	All the errors except the errors specified in `markForSuspension`	Errors to send the endpoint in to the "Suspended" state.
initialDuration	milliseconds	60 x 60 x 1000	After an endpoint gets "Suspended," it will wait for this amount of time before trying to send the messages coming to it. All the messages coming during this time period will result in fault sequence activation.
progressionFactor	Integer	1	The endpoint will try to send the messages after the `initialDuration. next duration = Max(initialDuration x progressionFactor ^ retry count, maximumDuration)`.
maximumDuration	milliseconds	Long.MAX_VALUE	Upper bound of retry duration.

Sample Configuration:

<endpoint name="Sample_First" statistics="enable" >
    <address uri="http://localhost/myendpoint" statistics="enable" trace="disable">
        <timeout>
            <duration>60000</duration>
        </timeout>

        <markForSuspension>
            <errorCodes>101504, 101505</errorCodes>
            <retriesBeforeSuspension>3</retriesBeforeSuspension>
            <retryDelay>1</retryDelay>
        </markForSuspension>

        <suspendOnFailure>
            <errorCodes>101500, 101501, 101506, 101507, 101508</errorCodes>
            <initialDuration>1000</initialDuration>
            <progressionFactor>2</progressionFactor>
            <maximumDuration>64000</maximumDuration>
        </suspendOnFailure>

    </address>
</endpoint>

Here the endpoint "Timeout" state is moved for errors 101504 and 101505. After this process, 3 requests can fail for one of these errors before moving the endpoint in to the "Suspended" state.

The endpoint is put in to suspension for errors 101500, 101501, 101506, 101507 and 101508. But the error 101503 is ignored. If error 101503 occurs, the endpoint will be in the "Active" state.

For more information about error codes, refer the #table below.

Failover Endpoint

With leaf endpoints, if an error occurs during a message transmission process, that message will be lost. The failed message will not be retried again. These errors occur very rarely, but still message failures can occur. With some applications these rare message loses are acceptable, but sometimes even these rare message failures are not acceptable and the failover endpoint is the ideal solution for it.

Here is the configuration for failover endpoints. At the configuration level, a failover is a logical grouping of one or more Leaf endpoints.

<failover>
       <endpoint .../>+
</failover>

When a message comes to the "Failover" state, it will go through its list of endpoints to pick the first one in "Active" or "Timeout" state. Then it will send the message using that particular endpoint. If an error occurs while sending the message, the failover will go through the endpoint list again from the beginning and will try to send the message using the first endpoint.

Some errors put the endpoint into "Timeout" and some keep the endpoint in the "Active" state. In these cases, the retry can happen using the same endpoint. If the failure occurs with the first endpoint within the failover group and this error does not put the endpoint in to Suspended state, the retry will happen using the same endpoint.

Failover gives priority to the first endpoint, which is not in the Suspended state. So it will send the message through the first endpoint in the failover group, as long as it is not "Suspended." When the first endpoint is "Suspended," it will send the requests using the second endpoint. When the first endpoint becomes ready to send again, it will try again, even though the second endpoint is still active.

If there is only one service endpoint and the message failure is not tolerable, failovers are possible with a single endpoint.

A sample failover with one address endpoint:

<endpoint name="SampleFailover">
    <failover>
        <endpoint name="Sample_First" statistics="enable" >
            <address uri="http://localhost/myendpoint" statistics="enable" trace="disable">
                <timeout>
                    <duration>60000</duration>
                </timeout>

                <markForSuspension>
                    <errorCodes>101504, 101505, 101500</errorCodes>
                    <retriesBeforeSuspension>3</retriesBeforeSuspension>
                    <retryDelay>1</retryDelay>
                </markForSuspension>

                <suspendOnFailure>
                    <initialDuration>1000</initialDuration>
                    <progressionFactor>2</progressionFactor>
                    <maximumDuration>64000</maximumDuration>
                </suspendOnFailure>

            </address>
        </endpoint>
    </failover>
</endpoint>

Here the Sample_First endpoint is marked as "Timeout" if a connection runs out of time, a connection close or sends IO errors. For all the other errors, it will be marked as "Suspended." When this error occurs, the failover will retry using the first non "Suspended" endpoint. In this case, it is the same endpoint (Sample_First). It will retry until the retry count becomes 0. The retry happens in parallel. Since messages come to this endpoint using many threads, the same message may not be retired 3 times. Another message may fail and can reduce the retry count.

Note

The retry count is not a per message based setting, it is a per endpoint based setting.

In this configuration, we assume that these errors are rare and if they happen once in a while, it is OK to retry again. If they happen frequently and continuously, which means that it requires immediate attention to get it back to normal state.

Error Codes

Error code	Description
101000	Receiver IO error sending
101001	Receiver IO error receiving
101500	Sender IO error sending
101501	Sender IO error receiving
101503	Connection failed
101504	Connection timed out
101505	Connection closed
101506	HTTP protocol violation
101507	Connect cancel
101508	Connect timeout
101509	Send abort