Indexing in Cassandra

Indexing is essential to support events and activity search functionality. Since we are storing data in Cassandra Column Families, columns which search operations to be performed on has to be indexed. For this purpose, users can either use inbuilt Cassandra secondary indexes or manual indexing of columns. Users can defined those properties to be indexed in BAM Toolboxes structure.

Secondary indexes support

In BAM Toolbox structure, there is a configuration file named 'streams.properties'. See Creating a Custom Toolbox.

   .tbox
        |-- analytics <folder>
                .
                .
         |-- dashboard <folder>
                .
                .
        |-- streamDefn <folder>
                |-- defn_1 <file>
                |-- defn_2 <file>
                |-- streams.properties <file>

Secondary index properties can be defined by adding following property to 'stream.properties':

streams.definitions.defn1.secondaryindexes=

For example, lets assume we need to add secondary indexes for 'resource_id' and 'direction' properties of BAM mediation stats data stream. Then the streams.properties file should be as follows,

streams.definitions=defn1
streams.definitions.defn1.filename=mediation_stats_stream_def
streams.definitions.defn1.description=This is the datastream published from mediation statistic data publisher.
streams.definitions.defn1.secondaryindexes=resource_id,direction

Custom indexes support

There are many limitations and performance issues in performing search operations with native secondary indexes. So we have introduced custom indexes support to overcome most of those issues.

The idea here is to keep a separate column family per index columns of primary column family (wide-rows model). One row will keep all index data as column keys(in cassandra we can have 2 billion columns keys in one row). Composite column key is created using the values of the index column, timestamp and the row key of primary CF. This composite key inserted as a column key of the relevant index row of Index CF. (The composite columns are ordered first by its first component, then by its second component etc…). Column value will be the row key of Primary CF. There should be a special row in index CF which keeps all unique index values(we need this in order to support GT/LT operations on data). Scenario is described below.

The three workflows are described below:

Read an event from primary column family (basically, a row from primary column family).
If the event contains a host (index column) value, create a composite key using the host value, timestamp and rowkey.
Add the column key to a particular index row in Index CF. Also, add the host value as a column key to your special indexRow

Also note the following:

To define custom index properties, add the following property to stream.properties: streams.definitions.defn1.customindexes=
To define custom indexes to arbitrary fields (fields that are not defined in stream definition), add the following property to stream.properties:streams.definitions.defn1.arbitraryindexes=
For example, assume that you need to create custom indexes for host and stats_type properties of BAM mediation stats data stream. Then the streams.properties file should be as follows,
```
streams.definitions=defn1
streams.definitions.defn1.filename=mediation_stats_stream_def
streams.definitions.defn1.username=admin
streams.definitions.defn1.password=admin
streams.definitions.defn1.description=This is the datastream published from mediation statistic data publisher.
streams.definitions.defn1.secondaryindexes=
streams.definitions.defn1.customindexes=stats_type,host
streams.definitions.defn1.arbitraryindexes=arbitaryfied1,arbitaryfield2
```
In the above example, two different columns families are created in EVENT_INDEX_KS for index stats_type data and host data.

Fixed search properties

There can be certain cases where each and every search query performed on a CF can have equality comparison on a certain property. 'Version' property of a particular stream is a good example, where every search operation can have operation like 'version=<version_value>' in it. Those fixed search properties can be defined by adding following property to 'stream.properties':

streams.definitions.defn1.fixedsearchproperties=

Different index CFs will not be created for defined fixed search properties. Instead, the property values will be included in each other Index CFs as a part of composite column key. For example, lets assume equality comparison operation on 'version' property will be included in every search query. Then the streams.properties file should be as follows,

streams.definitions=defn1
streams.definitions.defn1.filename=mediation_stats_stream_def
streams.definitions.defn1.username=admin
streams.definitions.defn1.password=admin
streams.definitions.defn1.description=This is the datastream published from mediation statistic data publisher.
streams.definitions.defn1.secondaryindexes=resource_id,direction
streams.definitions.defn1.customindexes=stats_type,host
streams.definitions.defn1.fixedsearchproperties=version

Since the index creation depend upon toolbox deployment, you have to make sure that toolboxes are configured and deployed properly. In addition, toolboxes need to be redeployed after Cassandra database cleanup.

Global activity index

To support new activity monitoring scenarios, you have a special global index to group all the correlated events per single activity. However, it is mandatory to publish correlated activity id (correlated data property named activity_id should be defined along with the stream definition) with relevant events.