WSO2 DAS Performance Analysis

This section summarizes the results of performance tests carried out with the minimum fully distributed DAS deployment setup with RDBMS (MySQL) and HBase event stores separately.

DAS Performance Test Round 1: RDBMS

Infrastructure used

c4.2xlarge Amazon EC2 instances as the DAS nodes
One DAS node as the publisher
A c3.2xlarge Amazon instance as the database node

Receiver node Data Persistence Performance

A reduction in the throughput is observed as shown below after 1200000 events in both DAS receiver nodes. This reduction is caused by limitations of MySQL. The receiver performance variation of the second node of the 2 node receiver cluster is as given below. As the initial buffer filling in receiver queues give very high receiver performance at the beginning of the event publishing, event rate after the first 1200000 events was considered for the following graph.

MySQL Breakpoint

After around 30 million events are published, a sudden drop can be observed in receiver performance that can be considered as the break point of MySQL event store. Another type of event store such as HBase event store should be used when the receiver performance has to be maintained unchanged.

Testing with large events

The following results were obtained by testing the two-node HA (High Availability) DAS cluster with 10 million events published via the Analyzing Wikepedia Data sample. Each event in this sample contains several kilobytes that represent large events.

In the above graph, TPS represents the total number of events published per second. This stabilizes at about 8500 events per second.

The above graph shows the amount of data that is published per second (referred to as the data rate). The data rate published is significantly reduced at the initial stages due to the flow control mechanisms of the receiver. It stabilizes at around 25 MB per second.

With MySQL RDBMS event store

DAS data persistence was measured by publishing to 2 load balanced receiver nodes with MySQL database.

Sample	Number of Events	Mean Event Rate
Smart Home sample	100000000	5741 events per second
Wikipedia sample	15901127	4438 events per second

Analyzer Performance

The following topics describe the analyzer performance of WSO2 DAS.

With MySQL RDBMS event store

Spark analyzing performance (time to complete execution) was measured using a 2 node DAS analyzer cluster with MySQL database.

Time taken for each type of Spark query is given below.

Data set	Event Count	Query Type	Time Taken (seconds)
Smart Home	10000000	`INSERT OVERWRITE TABLE cityUsage SELECT metro_area, avg(power_reading) AS avg_usage, min(power_reading) AS min_usage, max(power_reading) AS max_usage FROM smartHomeData GROUP BY metro_area`	26 sec
Smart Home	10000000	`INSERT OVERWRITE TABLE peakDeviceUsageRange SELECT house_id, (max(power_reading) - min(power_reading)) AS usage_range FROM smartHomeData WHERE is_peak = true AND metro_area = "Seattle" GROUP BY house_id`	22 sec
Smart Home	10000000	`INSERT OVERWRITE TABLE stateAvgUsage SELECT state, avg(power_reading) AS state_avg_usage FROM smartHomeData`	21 sec
Smart Home	10000000	`INSERT OVERWRITE TABLE stateUsageDifference SELECT a2.state, (a2.state_avg_usage-a1.overall_avg) AS avg_usage_difference FROM (select avg(state_avg_usage) as overall_avg from stateAvgUsage) as a1 join stateAvgUsage as a2`	`1 sec`
Wikipedia	10000000	`INSERT INTO TABLE wikiAvgArticleLength SELECT AVG(length) as avg_article_length FROM wiki`	48 min
Wikipedia	10000000	`INSERT INTO TABLE wikiContributorSummary SELECT contributor_username, COUNT(*) as page_count FROM wiki GROUP BY contributor_username`	1 hour 45 min
Wikipedia	10000000	`INSERT INTO TABLE wikiTotalArticleLength SELECT SUM(length) as total_article_chars FROM wiki`	44 min
Wikipedia	10000000	`INSERT INTO TABLE wikiTotalArticlePages SELECT COUNT(*) as total_pages FROM wiki`	1 hour 17 min

DAS Performance Test Round 2: HBase Cluster

This test involved setting up a 10-node HBase cluster with HDFS as the undrelying file system.

Infrastructure used

3 DAS nodes (variable roles: publisher, receiver, analyzer and indexer): c4.2xlarge
1 HBase master + Hadoop Namenode: c3.2xlarge
9 HBase Regionservers + Hadoop Datanodes: c3.2xlarge

Persisting 1 billion events from the Smart Home DAS Sample

This test was designed to test the data layer during sustained event publication. During testing, the TPS was around the 150K mark, and the HBase cluster's memstore flush (which suspends all writes) and minor compaction operations brought it down somewhat in bursts. Overall, a mean of 96K TPS was achieved, but a steady rate of around 100-150K TPS as is achievable, as opposed to the current no-flow-control situation.

The published data took around 950GB on the Hadoop filesystem, taking HDFS-level replication into account.

Events	1000000000
Time (s)	10391.768
Mean TPS	96230.01591

Persisting the entire Wikipedia corpus

This test involved publishing the entirety of the Wikipedia dataset, where a single event comprises of one Wikipedia article (16.8M articles in total). Events vary greatly in size, with the mean being ~3.5KB. Here, a mean throughput of around 9K TPS was observed.

Events	16753779
Time (s)	1862.901
Mean TPS	8993.381291

Running Spark queries on the 1 billion published events

Spark queries from the Smart Home DAS sample were executed against the published data, and the analyzer node count was kept 2 and 3 respectively for 2 separate tests. 6 processor cores and 12GB dedicated memory was given for the Spark JVMs during this test; and a throughput of over 1 million TPS on Spark for 2 analyzers and about 1.3 million TPS for 3 analyzers was observed.

DAS read operations from the HBase cluster also leverage HBase data locality, which would have made the read process more efficient compared to random reads.

Query	2 Analyzers		3 Analyzers
Query	Time(s)	Mean TPS	Time(s)	Mean TPS
`INSERT OVERWRITE TABLE cityUsage SELECT metro_area, avg(power_reading) AS avg_usage,min(power_reading) AS min_usage, max(power_reading) AS max_usage FROM smartHomeData GROUP BY metro_area`	958.80	1042968.20	741.15	1349250.90
`INSERT OVERWRITE TABLE ct SELECT count(*) FROM smartHomeData`	953.46	1048806.20	734.99	1360570.13
`INSERT OVERWRITE TABLE peakDeviceUsageRange SELECT house_id, (max(power_reading) - min(power_reading)) AS usage_range FROM smartHomeData WHERE is_peak = true AND metro_area = "Seattle" GROUP BY house_id`	975.06	1025581.77	751.27	1331073.47
`INSERT OVERWRITE TABLE stateAvgUsage SELECT state, avg(power_reading) AS state_avg_usage FROM smartHomeData GROUP BY state`	991.08	1009003.34	783.54	1276265.545

Running Spark queries on the Wikipedia corpus

Query	2 Analyzers		3 Analyzers
Query	Time(s)	Mean TPS	Time(s)	Mean TPS
`INSERT INTO TABLE wikiAvgArticleLength SELECT AVG(length) as avg_article_length FROM wiki`	222.70	75234.03	167.27	100164.18
`INSERT INTO TABLE wikiTotalArticleLength SELECT SUM(length) as total_article_chars FROM wiki`	221.74	75554.76	166.92	100373.80
`INSERT INTO TABLE wikiTotalArticlePages SELECT COUNT(*) as total_pages FROM wiki`	221.80	75536.05	166.14	100842.18
`INSERT INTO TABLE wikiContributorSummary SELECT contributor_username, COUNT(*) as page_count FROM wiki GROUP BY contributor_username`	236.11	70958.52	181.42	92350.26

Single Node Local Clustered Setup Statistics

A fully distributed setup was tested locally with multiple JVMs, and with the following hardware infrastructure specifications.

Machine type: Laptop

RAM: 8GB

Processor: Intel(R) Core(TM) i7-3520M

Storage: Samsung SSD 850

The setup is a 2 node analyzer cluster with a MySQL database as the event store. Analyzer statistics (time duration for each execution of the query) are given below.

Dataset	Event Count	Query Type	Time Taken
Wikipedia	15901127	`INSERT INTO TABLE wikiContributorSummary SELECT contributor_username, COUNT(*) as page_count FROM wiki GROUP BY contributor_username`	25 min
Wikipedia	15901127	`INSERT INTO TABLE wikiTotalArticleLength SELECT SUM(length) as total_article_chars FROM wiki`	25 min
Wikipedia	15901127	`INSERT INTO TABLE wikiTotalArticlePages SELECT COUNT(*) as total_pages FROM wiki`	25 min
Wikipedia	15901127	`INSERT INTO TABLE wikiAvgArticleLength SELECT AVG(length) as avg_article_length FROM wiki`	25 min

It was observed that the performance here is comparatively higher (taking into account that the setup consists of a single machine). This is mainly due to the DAS server and MySQL existing locally, and having no physical network I/O delays as a result. This allows the queries to be executed in an optimal manner.

Indexing Performance

In the following table, the shardIndexRecordBatchSize indicates the amount of index data (in bytes) to be processed at a time by a shard index worker.

Mode	Dataset	shardIndexRecordBatchSize	Replication Factor	Event Count	Time Taken (seconds)	Average TPS
Standalone	Wikipedia	10MB	NA	15901127	7975	1993.871724
Standalone	Wikipedia	20MB	NA	15901127	6765	2350.499187
Standalone	Smart Home	20MB	NA	20000000	1385	14440.43321
Minimum Fully Distributed	Wikipedia	20MB	1	15901127	6870	2314.574527
Minimum Fully Distributed	Wikipedia	20MB	0	15901127	7280	2184.220742

Events1000000000Time (s)10391.768Mean TPS96230.01591