WSO2 DAS has a distributed indexing engine that is built on top of Apache Lucene. Data is indexed and saved in the Data Access Layer of DAS in a store referred to as the Analytics File System. This section explains how indexing is carried out in WSO2 DAS.
A main requirement of DAS is to support arbitrary key value pairs. Most of the RDBMS have a fixed schema and do not support storing arbitrary key values. In DAS, records are stored as blobs. This way, the data fields in a record can be converted to a blob and stored together with the record ID and the timestamp of the record. Storing data as blobs has one drawback. That is database level indexing cannot be used for blobs to search by fields. To overcome this issue, records are sent through the indexer before the records are persisted in DAS so that the indexer can keep a searchable copy of record data along with the record ID. This indexer does not store the field values. It only stores the record ID and keeps an index of data fields. When a search query is sent to DAS, the indexer identifies the matching records, gets their record IDs, and sends them to the persistence layer/recordstore. In recordstore level, the record blobs are deserialized and returned as the record objects from the AnalyticsDataService
implementation.
Indexing Architecture
DAS indexer is implemented using Apache Lucene which is a full text search library. Users can index records and search for records later via Lucene queries. Events received by DAS are converted to a list of records and inserted into FileSystem
based queues. These queues are created in <DAS_HOME>/repository/data/index_staging_queues
directory. With a background thread, these queues are consumed and records are indexed. The indexed data is stored in the <DAS_HOME>/repository/data/index_data
directory. The DAS index consists of smaller indexes known as shards. A shard can be accessed nly by one index writer at a given time. (Index writer is the lucene class visible to outside world which is used to write lucene documents to a file system based index).Therefore having multiple shards can increase the write throughput (however, the write throughput can be limited by Disk IO operations). By default, DAS is configured to have six shards and one replica.
In a high availability deployment, at least one replica must be saved.
Indexing related configurations are done in the <DAS_HOME>/repository/conf/analytics/analytics-config.xml
file. Additionally, information relating to shards is maintained in the <DAS_HOME>/repository/conf/analytics/local-shard-allocation-config.conf
file. This file stores the shard number along with its state (that can be INIT
or NORMA
L). The INIT
is the initial state. Usually this state cannot be seen from outside. This is because the INIT
state changes to the NORMAL
state once the server starts. If the indexing node is running, the state of shards should be NORMAL
and not INIT.
The NORMAL
state denotes that the indexing node has started indexing. Therefore, whenever the data is ready to be indexed, the indexer node indexes the incoming data.
To re-index the whole dataset, follow the steps below:
- Shut down the WSO2 DAS server.
- Remove all the index data stored in the
<DAS_HOME>/repository/data
directory. - In the
<DAS_HOME>/repository/conf/analytics/local-shard-allocation-config.conf
file, change the mode for all the shards fromNORMAL
toINIT
. - Restart the WSO2 DAS server.
By default, a single DAS server is configured to have six shards, and one replica for each shard (i.e., a total of twelve shards). Therefore, in a minimum HA setup, each DAS server contains three shards and replicas of the other three shards. Even if one server is shut down, the second server has replicas of the three shards that were there in the node that was shut down. This allows the high availability to be maintained.
The number of replicas and the number of shards are defined in the analytics-config.xml
. By changing these properties we can change the number of shards and replicas, but once they are defined and the server is started, we cannot change them. Events are not needed to be published to both nodes. Replication happens through hazelcast messages. records are sent to other nodes via hazelcast.
The following example further explains how shards are managed in a clustered deployment.
If there are three DAS nodes in the cluster, all twelve shards (i.e., the shards and the replicas) are split among the three nodes. Each node will have four shrds which consists of 2 shards and 2 replicas.
Debugging Indexing in a cluster
If there seems to be an issue with indexing, do the following in the given order.
In the
<DAS_HOME>/repository/conf/analytics/analytics-config.xml
file check whether the shards are properly allocated.<!-- The number of index data replicas the system should keep, for H/A, this should be at least 1, e.g. the value 0 means there aren't any copies of the data --> <indexReplicationFactor>1</indexReplicationFactor> <!-- The number of index shards, should be equal or higher to the number of indexing nodes that is going to be working, ideal count being 'number of indexing nodes * [CPU cores used for indexing per node]' --> <shardCount>6</shardCount>
Check the logs that are printed when the cluster is initilized. The log must be similar to the example shown below.
TID: [-1] [] [2017-05-12 13:17:27,538] INFO {org.wso2.carbon.analytics.dataservice.core.indexing.IndexNodeCoordinator} - Indexing Initialized: CLUSTERED {0={c13d3a23-b15a-4b9c-ac8f-a30df2811c98=Member [10.3.24.70]:4000 this, bc751d36-d345-4b8f-b133-b77793f04805=Member [10.3.24.67]:4000}, 1={c13d3a23-b15a-4b9c-ac8f-a30df2811c98=Member [10.3.24.70]:4000 this, bc751d36-d345-4b8f-b133-b77793f04805=Member [10.3.24.67]:4000}, 2={c13d3a23-b15a-4b9c-ac8f-a30df2811c98=Member [10.3.24.70]:4000 this, bc751d36-d345-4b8f-b133-b77793f04805=Member [10.3.24.67]:4000}, 3={c13d3a23-b15a-4b9c-ac8f-a30df2811c98=Member [10.3.24.70]:4000 this, bc751d36-d345-4b8f-b133-b77793f04805=Member [10.3.24.67]:4000}, 4={c13d3a23-b15a-4b9c-ac8f-a30df2811c98=Member [10.3.24.70]:4000 this, bc751d36-d345-4b8f-b133-b77793f04805=Member [10.3.24.67]:4000}, 5={c13d3a23-b15a-4b9c-ac8f-a30df2811c98=Member [10.3.24.70]:4000 this, bc751d36-d345-4b8f-b133-b77793f04805=Member [10.3.24.67]:4000}} | Current Node Indexing: Yes {org.wso2.carbon.analytics.dataservice.core.indexing.IndexNodeCoordinator}
Here,
{0={c13d3a23-b15a-4b9c-ac8f-a30df2811c98=Member [10.36.241.70]:4000 this, bc751d36-d345-4b8f-b133-b77793f04805=Member [10.36.241.67]:4000}
means that the shard 0 is allocated to two nodes, and their IDs arec13d3a23-b15a-4b9c-ac8f-a30df2811c98
andbc751d36-d345-4b8f-b133-b77793f04805
. The IPs of the two nodes are also mentioned in the log line. For a correctly configured two-node DAS cluster, this log line must contain all six shards (from 0 to 5) and the node IDs to which they are allocated.This log line does not contain both the node IDs when you initially set up the cluster and start the first server because the other node has not joined the cluster. The log line is printed with the complete mapping of shards to node IDs only after the 2nd node joins the cluster.
The two node IDs mentioned in the log line must match the IDs mentioned in the
my-node-id.dat
of both DAS nodes. If there are more than two unique IDs in a 2 node cluster, the shard allocation may have been affectedand it mayneed to be corrected.
One reason for containing more than 2 ids is that, reconfiguring the cluster with new DAS packs without clearing the data sources in analytics-datasources.xml. When a new DAS pack is started pointing to the same analytics data sources, a new id is generated and the new DAS pack will be considered the third node who joined the cluster. When we are re-configuring the cluster with new DAS packs, we need to make sure that we first backup the older ids and restore them in the new packs. If the two node cluster already contain more than 2 unique ids ( due to starting a new DAS pack with new id), we have to remove the unnecessary/additional ids. Steps are as below.
Keep backup of the current my-node-id.dat of both nodes.
Extract the node ids mentioned in the log line mentioned above. This may contain more than 2 ids.
Separate out the ids which do not match with the my-node-id.dat of both nodes.
- Shutdown both nodes.
- Get one of the non-matching/unnecessary/additional node ids and replace the my-node-id.dat of any DAS node. (make sure you keep a backup of the my-node-id.dat before doing this)
Start the DAS node mentioned in step 5 with property “-DdisableIndexing=true”. This command will remove the node id that we put in my-node-id.dat from the indexing cluster.
Repeat the steps 4, 5, and 6 for all the additional node ids.
Finally, restore the two node ids (restore the my-node-id.dat files that you backed up).
Clean the repository/data folder
Go to local-shard-allocation.config.conf and replace the content with following.
0, INIT
1, INIT
2, INIT
3, INIT
4, INIT
5, INITDo the above for both nodes and start. You will see that the local-shard-allocation-config.conf content is changed to
0, NORMAL
1, NORMAL
2, NORMAL
3, NORMAL
4, NORMAL
5, NORMALNote that the above configuration is valid a DAS configured with 6 shards and 1 replicas. Step 10 should be changed according to the number of shards and replicas you have.
Indexing in Spark Environment
Spark runs in a different JVM. Therefore, Spark cannot write directly to Indexing queues. When we instruct the spark to index its summarized data, what Spark actually does is, it sends the summarized data to be indexed, to a storage area called staging area in the primary recordstore. So using the staging Index worker threads, these data is fetched and inserted to indexing queue. From that point onwards the indexing process is the same. Usually indexing can be slow if we frequently run the spark scripts (if those spark scripts contain instructions to index summarized data). Sometimes, there can be situations where the same data set is processed within a small time frame (e.g. spark scripts scheduled to run every 2 mins with indexing). In those cases, the same data is indexed over and over again. Best thing is to use incremental processing, So the reindexing of same data can be minimized.