Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A main requirement of DAS is to support arbitrary key value pairs. Most of the RDBMS have a fixed schema and do not support storing arbitrary key values. In DAS, records are stored as blobs. This way, the data fields in a record can be converted to a blob and stored together with the record ID and the timestamp of the record. Storing data as blobs has one drawback. That is database level indexing cannot be used for blobs to search by fields. To overcome this issue, records are sent through the indexer before the records are persisted in DAS so that the indexer can keep a searchable copy of record data along with the record ID. This indexer does not store the field values. It only stores the record ID and keeps an index of data fields. When a search query is sent to DAS, the indexer identifies the matching records, gets their record IDs, and sends them to the persistence layer/recordstore. In recordstore level, the record blobs are deserialized and returned as the record objects from the AnalyticsDataService implementation.

Indexing Architecture
Anchor
Architecture
Architecture

DAS indexer is implemented using Apache Lucene which is a full text search library. Users can index records and search for records later via Lucene queries. Events received by DAS are converted to a list of  records and inserted into FileSystem based queues. These queues are created in <DAS_HOME>/repository/data/index_staging_queues directory. With a background thread, these queues are consumed and records are indexed. The indexed data is stored in the <DAS_HOME>/repository/data/index_data directory. The DAS index consists of smaller indexes known as shards. A shard can be accessed only by one index writer at a given time. (Index writer is the lucene class visible to outside world that is used to write lucene documents to a file system based index).Therefore having multiple shards can increase the write throughput (however, the write throughput can be limited by Disk IO operations). By default, DAS is configured to have six shards and one replica.

...