This site contains the documentation that is relevant to older WSO2 product versions and offerings.
For the latest WSO2 documentation, visit https://wso2.com/documentation/.

Clustering Landscape for WSO2 Stack

This page contains general information on clustering WSO2 products. It describes clustering from an architectural perspective and introduces "End Goal", "Choice of the Front End," and "Shared States across nodes as three deciders in choosing the right clustering strategy for a particular use-case. For details on how to create a cluster of Governance Registry instances, see Creating a Governance Registry Cluster.

Clustering places a group of nodes together and tries to make them behave as a single cohesive units, often providing an illusion that it is a single computer. The picture below shows an outline of a Cluster, where it consists of collections of typically identical nodes, and a front end, which is called "Front End," distributes requests among nodes.

Moreover, nodes either share state (for example, shared database, shared file system) or update each other about state changes.

Clustering Landscape

Clustering differs based on the type of server or application used as the unit of clustering and the use-case the cluster is suppose to be serving. Consequently difficulty and associated trade-offs also vary. Overall, the following three parameters captures different clustering requirements.

  • Goal of Clustering - Load balancing, high availability or both.
  • Choice of the Front End - Synapse, Apache, Hardware, DNS.
  • State shared across different nodes in the cluster.

End Goals of Clustering

Users cluster their services to scale up their applications, to achieve high-availability or to achieve both. By scaling up, the application supports a larger number of user requests and through high availability the application will be available even if few servers are down. To support load balancing, the front end distributes requests among the nodes in the cluster. There are wide verity of algorithms based on the policy of distributing the load. For example, random or round-robin distribution of messages among simple approaches or more sophisticated algorithm take runtime properties in the system like machine load or number of pending requests in to consideration.

The distribution could also be controlled by application specific requirements like sticky sessions. However, it is worth noting that with reasonably a diverse set of users, simple approaches tend to perform in par with complex approaches and therefore, they should be given the considerations first.

Clustering setup for high availability includes multiple copies of the server and the user requests are redirected to an alternative server if the primary server has been failed. This setup is also called passive replications (a.k.a. hot cold setup) where the primary serves the incoming requests and falls back to another server only if the first server has failed.

The main difficulty in this scenario is that failure detections in a distributed environment is a very hard problem and often it is impossible to differentiate between network failures and server failures. However, most deployments run in highly available networks and typically we ignore network failures and detect server failures through failed requests.This yields pretty reliable results, yet it is worth noting that it is only an approximation.

To achieve high availability and scalability is desired, both of above techniques are applied simultaneously. For instance, a typical deployment includes a front-end server that distributes messages across a group of servers and if a server failed, the Front End removes the server from its list of active services among which it distributes the load. Such a setup degrades gracefully in the face of failures where after a failure of one among N servers, the system only degrades by 1/N.

Front End

To denote the mechanism to distribute requests across nodes in the cluster based on the current system, the term "Front end" is used and often it is called the Load Balancer as well, a name which we will often use interchangeably. Load Balancer comes in wide verities and among them are:

  • Hardware Load Balancers
  • DNS Load Balancers
  • Transport level - HTTP level like Apache, Tomcat.
  • Application level - WSO2 ESB, Synapse.

High level Load Balancers, like Application level Load Balancers, operate with more information about the messages they route, hence provide more flexibility, but also incur more overhead. Hence the choice of Load Balancer is a trade-off between performance and flexibility. Following are a few of the options as Load Balancers.

  • Enterprise Service Bus - Provides application level load balancing. For instance, can take the routing decisions based on the content of SOAP message. They provide great control, but low performance.
  • Apache mod_proxy - The same as above, but operates at the HTTP level rather than at the SOAP level.
  • DNS Load Balancing - DNS lookups each time return a different servers when it is quired and the clients end up talking to one server among a group of servers thus distributing their load. With this approach, decisions based on messages are not possible.
  • Hardware Load Balancers - The logic is implemented as hardware, and they can be very fast. However, the logic can not be changed later hence they are not-flexible. Therefore, they usually provide low level load balancing.

A common concern with the Front End (Load Balancer) is its reliability. In other words, what if the Front End has failed. Typically, the Front End need to be protected though a external mechanism. For example, Passive Load Balancer with Linux High availability or a system management framework that monitors and recovers the Front End if failed. Moreover, DNS Load Balancer can handle the problem by setting up DNS servers to return only active instances.

It is also possible to use two level of Load Balancers where one provides simple transport level load balancing while the next level provide application level load balancing. If this method is used, low level Load Balancer like DNS distribute requests across multiple load balancing nodes, which make decisions considering message level information, and this approach, as an added advantage, avoids single point of failure for the Load Balancer.

Shared State

Third parameter, which defines the state shared across different cluster nodes, is one of the hardest aspects of clustering, and it usually defines limits of clustered system scalability. As a rule of thumb, less amount of state is shared and less consistency is required across nodes, more scalable is the cluster.

The following picture shows four use cases based on how state is shared across cluster nodes.

In the most simple case, the cluster nodes do not share any state at all with each other and in this case clustering involves placing a load balancing node in front of the nodes and redirecting requests to the nodes as necessary. This can be archived with:

  • WSO2 ESB
  • Apache mod_proxy Load Balancer
  • DNS level
  • Hardware load balancing

In the second scenario, all the persistent application state is stored in a database and all services are themselves stateless. For instance, most three tire architectures falls under this method. These systems are clustered by connecting all the nodes of the cluster to the same database and system can scale up till the database is overwhelmed. With enterprise class databases, this setup is known to scale to thousands on nodes, and this is one of the most common deployment setup. A common optimization is caching data from reads at the service level for performance, but unless nodes flush their caches immediately following a write, the reads may return old data from the cache. However, all the writes to the database must be transactional. Otherwise, concurrent writes might leave the database in a inconsistent state.

In the third scenario, requests are bound in to a session, on which case requests on the same session shares some state with each other. One of the simplest way to support this scenario is using sticky sessions, in which case the Load Balancer always redirects messages belonging to a same session to a single node, hence following requests can share state with the previous requests. This session can be implemented at two levels:

  • The HTTP level
  • The SOAP level

One downside of this approach is that if a node has failed, the sessions associated with that nodes are lost and need to be restarted. It is common to couple database based system described in scenario 2 with sticky sessions in practice, where session data is kept in memory, but persistent data are saved into a database.

Finally, if cluster nodes share state and they are not stored in same shared persistent media like a database all changes done at each node have to be disseminated to all other nodes in the cluster. This is often implemented using a some kind of group communication methods.

Group Communication keeps track of the members of groups of nodes defined by users and updates the group membership when nodes have joined or left. When a message is sent using group communication, it guarantees that all nodes in the current group will receive the message. In this scenario, clustering implementation uses group communication to disseminate all the changes to all other nodes in the cluster. For example, Tribes group communication framework is used by both the tomcat and Axis2 clustering.

This approach have limits in its scalability where Group Communication typically get overloaded around 8-15 nodes.

However, that is well within most user needs, hence sufficient. Again there are two choices whether updates are synchronous or asynchronous where in the former case the request does not return till all the replicas of the system are updated, whereas the latter updates the replicas in the background. In the latter case, there may be a delay before values are reflected in the other nodes. Coupled with sticky sessions this could be acceptable if the consistency of the system is not a concern. For example, if all write operations only perform appends and does not edit previous values, lazy propagation of changes are sufficient and sticky sessions will ensure user who did the change will continue to see his changes. Alternatively, if the read-write ratio is sufficiently high, all writes can be done in a one node while other serves reads.

Another interesting dimension is auto scaling which is the ability to grow and shrink the size of the cluster based on the load received by the client. This is particularly interesting with the advent of Cloud computing which enables "only pay for what you use" idea. Conceptually, this is implemented using a special Load Balancer which keeps track of membership of nodes in the cluster and adds or remove new nodes based on their utilization.

Clustering support in WSO2 Products

Clustering is supported by the following WSO2 products:

  • WSO2 AS (WSO2 Application Server)
  • WSO2 ESB (Enterprise Service Bus)
  • WSO2 Governance Registry
  • WSO2 Business Process Server

For more information on clustering in other products, see here.

WSO2 products support all three goals:

  • To scale up (load balancing),
  • To achieve high availability,
  • Both described earlier

As the Load Balancer, any of the methods mentioned above can be also used.

Different products require different levels of state replications based on the exact use-case it is utilized for.

For instance, ESB often requires full state replication across the nodes in the Cluster, but nodes in a Governance Registry cluster usually share the same database, hence do not need full state replication. Level of replication required for AS also depends on the particular use-case. For instance, most applications that adheres to the tree-tier architecture where the service tier is stateless does not need full state replication across all nodes in the cluster, but if the services in the application keep state within the application memory. They needs full state replication.

WSO2 Clustering Architecture

When cluster nodes are stateless or when they share data through an external data source like Database or a shared file system, replication of their state is not required and hence such deployments are comparatively trivial. However, if each node has inherent state, dissemination of updates which require full replication is often required. Let us look at the full replication implementation of WSO2 products.

Each WSO2 product is built using Apache Axis2 as the base and to update dissemination to nodes WSO2 products use Apache Axis2 clustering implementation. Axis2 clustering implementation uses Group Communication to replicate the state across all nodes in the Cluster which keeps track of the current membership of the cluster and also guarantees that each message multicasted to the group are received by all the current members of the group. Use of Group Communication is the state art in implementing state replication. When state in a cluster node has changed, each change is sent to every node in the cluster and that change is applied at each node in a same order. These changes can be applied in two modes:

  • Change operation only returns after all the changes are applied
  • Change operation immediately returns

However, it is worth noting that this approach assumes that when a cluster node receives an update, applying that update to local node's state does not fail. If that assumption is not realistic, then Two Phase Commit Protocol need to be used in applying changes. Axis2 does not support this at the time of writing and hence WSO2 products also do not support this mode.

Another Consequence of this limitation is that Axis2 clustering only replicates changes to runtime state, but does not replicate changes to system configurations. The state held in Axis2 is categorized as runtime state and configurations and Clustering only replicate the context hierarchy which is the runtime state of the services, but does not replicate the configuration hierarchy. Therefore, if the service configuration is changed at the runtime, for instance, added a service, that change is not replicated to other nodes. The primary challenge in this case is that, unlike with runtime configurations, configuration updates access files and complex data structures and we cannot assume that configuration updates to services do not fail after it is delivered to a service. Supporting configuration replication runs the risk of leaving some nodes in the cluster out of sync, hence leading to untold complications. As a result, not replicating the configuration is a conscious decision taken at design time and therefore, updating configurations require a restart.

Moreover, one obvious concern is that with the aforementioned setup, the Load Balancer becomes the single point of failure, and we recommend that Linux HA should be used here to provide a cold backup.

Linux HA cannot be used to provide high availability for Web Service containers because it cannot take application level decisions about failure.

WSO2 Cluster Management Support

Typically, all nodes in a cluster run an identical configuration and for the ease of management, most deployments configure all cluster nodes to load all their configurations from the same place. To support this use-case, WSO2 products support using registry or a URL repository to store the configurations. In the former case, all configurations are placed in a registry and multiple WSO2 products can be pointed to the same registry. In the latter case, URL of a remote web location can be given as the repository for WSO2 products installed in cluster nodes. If Governance Registry is used to store configurations, it provides an user-friendly interface to manage and update configurations and also provides add on services like versioning and tags.

However, as explained earlier, changes to configurations are not replicated and therefore, updating configurations in a cluster requires a complete restart. We recommended the following pattern to update configurations of a Cluster. WSO2 products support a maintenance mode, which enable administrators to stop it from accepting new requests while continuing to process in progress requests. On this setting, updating a cluster involves taking half of its nodes into the maintenance mode, apply changes when they have processed all the requests, restarting them and then doing the same for the other half after bringing updated nodes back online. This will ensure that the configurations are updated while the cluster continues to server user requests.

Finally, manually monitoring and managing nodes in a cluster could be tedious, and WSO2 products support JMX to solve this problem. Therefore, cluster nodes can be monitored though any of the standard JMX clients or Management Consoles. A Cluster Management solution, which would provide a comprehensive solution to clustering management is under development.

Getting Your Feet Wet

If you are interested in setting up clustered version of WSO2 products, it involves primarily two tasks:

  • Setting up a front end - Provide high availability
  • Load balancing

It can also include both tasks and if required, setting up state replication between cluster nodes. For details on how to create a cluster of Governance Registry instances, see Creating a Governance Registry Cluster.

Recommended Front End (Load Balancer) for a cluster of WSO2 products is WSO2 ESB and to set it up, users should configure a load balancing endpoint or a fail over endpoint. ESB Sample Index provide examples on how to setup those endpoints. Another alternative is using Apache mod proxy module configuring for mod-proxy is as same as configuring tomcat against Apache with mod_proxy and there are many resources on the topic (For example, Load Balancing Tomcat with Apache).

In articles Introduction to WSO2 Carbon Clustering and WSO2 Carbon Cluster Configuration Language, Azeez Afkham describes how to configure full state replication for WSO2 products by configuring Axis2 clustering appropriately. Moreover, the article WSO2 ESB / Apache Synapse Clustering Guide describes clustering applied to WSO2 ESB. Furthermore, more details can be found under the clustering sections of each WSO2 Product.

"End Goal", "Choice of Front End" and "Shared States across nodes" as three deciders in choosing right clustering strategy for a particular use-case. The following table summarizes different choices available under each decider.

 

Goal of Clustering

Choice of the Front End

State shared

Choices

  1. Load Balancing
  2. High Avialability
  3. Both
  1. Apache Synapse
  2. Apache HTTPD mod_proxy
  3. Hardware
  4. DNS
  1. Nothing Shared
  2. Shared Database
  3. Sticky Sessions
  4. Full Replication

Moreover, the article provides guidelines on when to use different choices available under each decider. High availability or load balancing does not always require full state replication, and this article provides guidelines on choosing the right clustering architecture. The article will be useful for architects who want to choose the right clustering solution from WSO2 clustering solutions.