Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A main requirement of DAS is to support arbitrary key value pairs. Most of the RDBMS have a fixed schema and do not support storing arbitrary key values. In DAS, records are stored as blobs. This way, the data fields in a record can be converted to a blob and stored together with the record ID and the timestamp of the record. Storing data as blobs has one drawback. That is database level indexing cannot be used for blobs to search by fields. To overcome this issue, records are sent through the indexer before the records are persisted in DAS so that the indexer can keep a searchable copy of record data along with the record ID. This indexer does not store the field values. It only stores the record ID and keeps an index of data fields. When a search query is sent to DAS, the indexer identifies the matching records, gets their record IDs, and sends them to the persistence layer/recordstore. In recordstore level, the record blobs are deserialized and returned as the record objects from the AnalyticsDataService implementation.

Indexing Architecture
Anchor
Architecture
Architecture

DAS indexer is implemented using Apache Lucene which is a full text search library. Users can index records and search for records later via Lucene queries. Events received by DAS are converted to a list of  records and inserted into FileSystem based queues. These queues are created in <DAS_HOME>/repository/data/index_staging_queues directory. With a background thread, these queues are consumed and records are indexed. The indexed data is stored in the <DAS_HOME>/repository/data/index_data directory. The DAS index consists of smaller indexes known as shards. A shard can be accessed only by one index writer at a given time. (Index writer is the lucene class visible to outside world that is used to write lucene documents to a file system based index).Therefore having multiple shards can increase the write throughput (however, the write throughput can be limited by Disk IO operations). By default, DAS is configured to have six shards and one replica.

...

Indexing related configurations are done in the <DAS_HOME>/repository/conf/analytics/analytics-config.xml file. Additionally, information relating to shards is maintained in the <DAS_HOME>/repository/conf/analytics/local-shard-allocation-config.conf file. This file stores the shard number along with its state (that can be INIT or NORMAL). The INIT is the initial state. Usually this state cannot be seen from outside. This is because the INIT state changes to the NORMAL state once the server starts. If the indexing node is running, the state of shards should be NORMAL and not INIT. The NORMAL state denotes that the indexing node has started indexing. Therefore, whenever the data is ready to be indexed, the indexer node indexes the incoming data.To re-index the whole dataset, follow the steps below: Anchorreindexreindex

...

.

...

By default, a single DAS server is configured to have six shards, and one replica for each shard (i.e., a total of twelve shards). Therefore, in a minimum HA setup, each DAS server contains three shards and replicas of the other three shards. Even if one server is shut down, the second server has replicas of the three shards that were there in the node that was shut down. This allows the high availability to be maintained.

The number of replicas and the number of shards are defined in the the analytics-config.xml. By changing these properties we can change the number of shards and replicas, but once they are defined and the server is started, we cannot change them. Events are are not needed  needed to be published to both nodes. Replication happens through hazelcast messages by sending records to other nodes. 

...

If there are three DAS nodes in the cluster, all twelve shards (i.e., the shards and the replicas) are split among the three nodes. Each node will have four shrds which consists of 2 shards and 2 replicas.

Anchor
reindex
reindex
Re-indexing existing data

To re-index the whole dataset, follow the steps below:

  1. Shut down the WSO2 DAS server.
  2. Remove all the index data stored in the <DAS_HOME>/repository/data directory.
  3. In the <DAS_HOME>/repository/conf/analytics/local-shard-allocation-config.conf file, change the mode for all the shards from NORMAL to INIT.
  4. Restart the WSO2 DAS server.


Debugging Indexing in a cluster

...