Handling Split-brain Situations
Introduction
A split-brain situation (also known as a network partition situation) occurs when a failure in the network caused the nodes to split into two different networks. This usually occurs in environments where the network is not stable. If this situation occurs in a two-node minimum high availability cluster, it divides into two networks, and each node functions as a leader. This has the following results:
- Each node becomes the active node in its own network. This results in both the nodes publishing duplicated output messages from the data you send to the DAS cluster.
- Once the network is stable again, there are two leaders in the cluster.
- Once the network is stable again, the Spark master triggers a shutdown to eliminate the worker node that attempts to re-register.
In order to address the above issues, you need to introduce a dummy node to the cluster. This involves defining a quorum size, and this quorum size is 2 in a three-node scenario. Once a member is removed from the cluster, each node (other than the dummy node) evaluates whether it belongs in the cluster or not depending on the quorum size. If the quorum is satisfied, the node functions as normal. If the quorum is not satisfied, it shuts down itself.
- A WKM (Well-Known Member) is a node of a cluster that tracks the heartbeats of other nodes of the cluster based on the Hazlecast WKA (Well-Known Addressing) membership schema. In a DAS minimum HA two-node cluster, both nodes are WKMs. At a given time, only one node can be the leader of the cluster. This is handled by the Hazlecast based implementation.
- A dummy node is a WSO2 DAS node with DAS functionalities such as indexing, analytics engine, spark analytics, event publishing etc., are disabled. It exists to handle the split-brain situation, and not to receive, process and publish events.
Configuring the cluster
The following sections cover how to configure a DAS Minimum HA deployment that can handle the split-brain situation.
Configuring the dummy DAS node
To configure the dummy node, do the following configurations to a Vanilla/WUM-updated WSO2 DAS, WSO2 ESB Analytics, WSO2 IS Analytics, WSO2 API-M Analytics or a WSO2 BPS Analytics distribution (with no manual configurations):
You can configure the dummy node in the same virtual machine where one of the DAS nodes is configured. In such a scenario, you need to do the following:
To do a port offset for the dummy node, open the
<DAS_HOME>/repository/conf/carbon.xml
file and set the value of theOffset
parameter in theServer
/Ports
section to1
as shown below.<Offset>1</Offset>
To update the local member port, open the
<DAS_HOME>/repository/conf/axis2/axis2.xml
file and update the value of thelocalMemberPort
parameter to4100
as shown below.<parameter name="localMemberPort">4100</parameter>
- Update theÂ
<DAS_HOME>/repository/conf/axis2/axis2.xml
 file as follows to enable Hazelcast clustering for both nodes.Set theÂ
clustering class="org.wso2.carbon.core.clustering.hazelcast.HazelcastClusteringAgent"
 parameter toÂtrue
 as shown below to enable Hazlecast clustering.<clustering class="org.wso2.carbon.core.clustering.hazelcast.HazelcastClusteringAgent" enable="true">
Enable theÂ
wka
 mode on both nodes as shown below.Â<parameter name="membershipScheme">wka</parameter>
Add both the other DAS nodes as well known members in the cluster under theÂ
members
 tag in each node as shown in the example below.The dummy node must not be added as a well-known member.
<members> <member> <hostName>[node1 IP]</hostName> <port>[node1 port]</port> </member> <member> <hostName>[node2 IP]</hostName> <port>[node2 port]</port> </member> </members>
For each node, enter the respective server IP address as the value of theÂ
localMemberHost
 property as shown below.<parameter name="localMemberHost">[Server_IP_Address]</parameter>
You need to also make sure that all 3 nodes have the same domain name specified via theÂ
domain
parameter (i.e., the<parameter name = "domain" > wso2.carbon.domain
</parameter>
).
Create a file namedÂ
hazlecast.properties
 in theÂ<DAS_HOME>/repository/conf
directory. Include the following properties in it.hazelcast.max.no.heartbeat.seconds=30 hazelcast.max.no.master.confirmation.seconds=45
Configuring the existing DAS nodes
To configure the existing two nodes of the minimum HA deployment, stop one node at a time and do the following configurations:
To configure the static quorum strategy, open the
<DAS_HOME>repository/conf/analytics/analytics-config.xml
file and add the following as a child element of theanalytics-dataservice-configuration
element.<static-quorum enabled="true"> <quorum-size>2</quorum-size> </static-quorum>
Create a file namedÂ
hazlecast.properties
 in theÂ<DAS_HOME>/repository/conf
 directory. Include the following properties in it.If the file already exists, add the following properties to it.
hazelcast.max.no.heartbeat.seconds=30 hazelcast.max.no.master.confirmation.seconds=45
- In the
<DAS_HOME>/repository/conf/analytics/spark/spark-defaults.conf
 file,  change the value of theÂspark.akka.timeout
parameter to 100s (this is 1000s by default).
Starting the cluster
Once you complete the configurations mentioned above, start the DAS nodes as follows:
- First, start the two existing DAS nodes of the cluster. Â For more information about the expected results, see Configuring a Minimum High Availability Cluster .
Start the dummy DAS node as the last node of the 3-node DAS cluster by issuing the following command from
<DAS_HOME>
.
./bin/wso2server.sh -DdisableAnalyticsEngine=true -DdisableAnalyticsExecution=true -DdisableIndexing=true -DdisableDataPurging=true -DdisableAnalyticsSparkCtx=true -DdisableAnalyticsStats=true -DdisableMl=true  -DdisableEventSink=true start
Once the node has successfully started, a log similar to the following is logged in the CLI of each existing DAS node.INFO {org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} - Member joined [<uuid>]: /<ip>:<port>
Expected behaviour
When a DAS minimum HA deployment is configured with a dummy node, there are three possible outcomes that can result when a split-brain scenario takes place:
Â
- The dummy node getting isolated
 If the dummy node is isolated, it does not shut itself down because it is not configured to follow the static-quorum strategy. The two-node cluster (without the dummy node) continues to function with the leader it had before the split-brain scenario took place (i.e., DAS 1 in the diagram) - The existing leader getting isolated
 If the existing leader of the DAS cluster (i.e., DAS 1) is isolated after the brain-split scenario, the passive node of the cluster (i.e., DAS2) evaluates itself based on the static-quorum strategy and identifies that it still belongs to the cluster. It further identifies that the leader has left the cluster and elects itself as the leader. DAS1 identifies that it no longer belongs to a cluster based on the static-quorum strategy, and shuts itself down. - The existing passive node getting isolated
If the existing passive node of the cluster is isolated after the split-brain scenario, the active node identifies that it still belongs to the cluster based on the static-quorum strategy. Therefore, it continues to operate as the leader of the cluster. DAS2 identifies that it no longer belongs to a cluster based on the static-quorum strategy, and shuts itself down.
When either DAS1 or DAS2 is isolated after a split-brain scenario, a log similar to the following sample log is displayed in its CLI.
TID: [-1] [] [2018-01-16 11:56:32,144] INFO {org.wso2.carbon.analytics.dataservice.core.clustering.AnalyticsClusterManagerImpl} - [Current members]: 1 [Quorum size]: 2 - Quorum is not satisfied, this node will be shutdown now... {org.wso2.carbon.analytics.dataservice.core.clustering.AnalyticsClusterManagerImpl}