com.atlassian.confluence.content.render.xhtml.migration.exceptions.UnknownMacroMigrationException: The macro 'next_previous_link3' is unknown.

Handling Split-brain Situations

Introduction

A split-brain situation (also known as a network partition situation) occurs when a failure in the network caused the nodes to split into two different networks. This usually occurs in environments where the network is not stable. If this situation occurs in a two-node minimum high availability cluster, it divides into two networks, and each node functions as a leader. This has the following results:

  • Each node becomes the active node in its own network. This results in both the nodes publishing duplicated output messages from the data you send to the DAS cluster.
  • Once the network is stable again, there are two leaders in the cluster.
  • Once the network is stable again, the Spark master triggers a shutdown to eliminate the worker node that attempts to re-register.

In order to address the above issues, you need to introduce a dummy node to the cluster. This involves defining a quorum size, and this quorum size is 2 in a three-node scenario. Once a member is removed from the cluster, each node (other than the dummy node) evaluates whether it belongs in the cluster or not depending on the quorum size. If the quorum is satisfied, the node functions as normal. If the quorum is not satisfied, it shuts down itself.

  • A WKM (Well-Known Member) is a node of a cluster that tracks the heartbeats of other nodes of the cluster based on the Hazlecast WKA (Well-Known Addressing) membership schema. In a DAS minimum HA two-node cluster, both nodes are WKMs. At a given time, only one node can be the leader of the cluster. This is handled by the Hazlecast based implementation.
  • A dummy node is a WSO2 DAS node with DAS functionalities such as indexing, analytics engine, spark analytics, event publishing etc., are disabled. It exists to handle the split-brain situation, and not to receive, process and publish events.

Configuring the cluster

The following sections cover how to configure a DAS Minimum HA deployment that can handle the split-brain situation.

Configuring the dummy DAS node

To configure the dummy node, do the following configurations to a Vanilla/WUM-updated WSO2 DAS, WSO2 ESB Analytics, WSO2 IS Analytics, WSO2 API-M Analytics or a WSO2 BPS Analytics distribution (with no manual configurations):

You can configure the dummy node in the same virtual machine where one of the DAS nodes is configured. In such a scenario, you need to do the following:

  • To do a port offset for the dummy node, open the <DAS_HOME>/repository/conf/carbon.xml file and set the value of the Offset parameter in the Server/Ports section to 1 as shown below.

    <Offset>1</Offset>
  • To update the local member port, open the <DAS_HOME>/repository/conf/axis2/axis2.xml file and update the value of the localMemberPort parameter to 4100 as shown below.

    <parameter name="localMemberPort">4100</parameter>


  1. Update the <DAS_HOME>/repository/conf/axis2/axis2.xml file as follows to enable Hazelcast clustering for both nodes.
    1. Set the clustering class="org.wso2.carbon.core.clustering.hazelcast.HazelcastClusteringAgent" parameter to true as shown below to enable Hazlecast clustering.

      <clustering class="org.wso2.carbon.core.clustering.hazelcast.HazelcastClusteringAgent" enable="true">
    2. Enable the wka mode on both nodes as shown below. 

      <parameter name="membershipScheme">wka</parameter>
    3. Add both the other DAS nodes as well known members in the cluster under the members tag in each node as shown in the example below.

      The dummy node must not be added as a well-known member.

      <members>
          <member>
              <hostName>[node1 IP]</hostName>
              <port>[node1 port]</port>
          </member>
          <member>
              <hostName>[node2 IP]</hostName>
              <port>[node2 port]</port>
          </member>
      </members>
    4. For each node, enter the respective server IP address as the value of the localMemberHost property as shown below.

      <parameter name="localMemberHost">[Server_IP_Address]</parameter>

      You need to also make sure that all 3 nodes have the same domain name specified via the domain parameter (i.e., the <parameter name = "domain" > wso2.carbon.domain </parameter>).

  2. Create a file named hazlecast.properties in the <DAS_HOME>/repository/conf directory. Include the following properties in it.

    hazelcast.max.no.heartbeat.seconds=30
    hazelcast.max.no.master.confirmation.seconds=45

Configuring the existing DAS nodes

To configure the existing two nodes of the minimum HA deployment, stop one node at a time and do the following configurations:

  1. To configure the static quorum strategy, open the <DAS_HOME>repository/conf/analytics/analytics-config.xml file and add the following as a child element of the analytics-dataservice-configuration element.

    <static-quorum enabled="true">
    	<quorum-size>2</quorum-size>
    </static-quorum>
  2. Create a file named hazlecast.properties in the <DAS_HOME>/repository/conf directory. Include the following properties in it.

    If the file already exists, add the following properties to it.

    hazelcast.max.no.heartbeat.seconds=30
    hazelcast.max.no.master.confirmation.seconds=45
  3. In the <DAS_HOME>/repository/conf/analytics/spark/spark-defaults.conf file,  change the value of the spark.akka.timeout parameter to 100s (this is 1000s by default).

Starting the cluster

Once you complete the configurations mentioned above, start the DAS nodes as follows:

  1. First, start the two existing DAS nodes of the cluster.  For more information about the expected results, see Configuring a Minimum High Availability Cluster.
  2. Start the dummy DAS node as the last node of the 3-node DAS cluster by issuing the following command from <DAS_HOME>.
    ./bin/wso2server.sh -DdisableAnalyticsEngine=true -DdisableAnalyticsExecution=true -DdisableIndexing=true -DdisableDataPurging=true -DdisableAnalyticsSparkCtx=true -DdisableAnalyticsStats=true -DdisableMl=true  -DdisableEventSink=true start

    Once the node has successfully started, a log similar to the following is logged in the CLI of each existing DAS node.

    INFO {org.wso2.carbon.core.clustering.hazelcast.wka.WKABasedMembershipScheme} -  Member joined [<uuid>]: /<ip>:<port>

Expected behaviour

When a DAS minimum HA deployment is configured with a dummy node, there are three possible outcomes that can result when a split-brain scenario takes place:
 

  • The dummy node getting isolated
     If the dummy node is isolated, it does not shut itself down because it is not configured to follow the static-quorum strategy. The two-node cluster (without the dummy node) continues to function with the leader it had before the split-brain scenario took place (i.e., DAS 1 in the diagram)
  • The existing leader getting isolated
     If the existing leader of the DAS cluster (i.e., DAS 1) is isolated after the brain-split scenario, the passive node of the cluster (i.e., DAS2) evaluates itself based on the static-quorum strategy and identifies that it still belongs to the cluster. It further identifies that the leader has left the cluster and elects itself as the leader. DAS1 identifies that it no longer belongs to a cluster based on the static-quorum strategy, and shuts itself down.
  • The existing passive node getting  isolated
    If the existing passive node of the cluster is isolated after the split-brain scenario, the active node identifies that it still belongs to the cluster based on the static-quorum strategy. Therefore, it continues to operate as the leader of the cluster. DAS2 identifies that it no longer belongs to a cluster based on the static-quorum strategy, and shuts itself down.

When either DAS1 or DAS2 is isolated after a split-brain scenario, a log similar to the following sample log is displayed in its CLI.

TID: [-1] [] [2018-01-16 11:56:32,144]  INFO {org.wso2.carbon.analytics.dataservice.core.clustering.AnalyticsClusterManagerImpl} -  [Current members]: 1 [Quorum size]: 2 - Quorum is not satisfied, this node will be shutdown now... {org.wso2.carbon.analytics.dataservice.core.clustering.AnalyticsClusterManagerImpl}
com.atlassian.confluence.content.render.xhtml.migration.exceptions.UnknownMacroMigrationException: The macro 'next_previous_links2' is unknown.