Streaming Data Summarization (Incremental Aggregation)
Introduction
In the previous tutorial, you looked at the Siddhi real time data summarization capabilities by calculating the total production in the past minute.Now let's consider a more advanced scenario where you need to calculate the total value for a specific time period.
In this scenario, the foreman of the Sweet Factory needs to know the total production of Sherbet Lemon during each hour in November 2017.
It is costly to do this by recalculating the total for each and every event. What you need is a time based aggregation of the events in real time and retrieval on demand. Siddhi supports this functionality through the Incremental Aggregation concept.
Incremental Aggregation calculates the aggregated values continuously and stores them. These values can be retrieved efficiently from the store on demand. Furthermore, Incremental Aggregators support out of order event arrival with in-memory buffers for higher accuracy.
This tutorial covers the following concepts:
Introduction to incremental aggregation
Retrieval from incremental aggregation
Before you begin:
In this scenario, information sent by the Sweet Bots are stored in a MySQL table named SweetFactoryDB. You need to download and install MySQL, and create this table before you carry out the tutorial steps.
Tutorial steps
Lets get started!
User Scenario 1: Defining incremental aggregation
In this scenario, lets define an incremental aggregation to calculate the total production in an incremental manner, and store the results.
Let's define an input stream as follows based on the data received from Sweet Bots. This is the same stream definition used in the previous tutorials to capture the name of the sweet category and the amount produced.
define stream SweetProductionStream (name string, amount long);Now, let's define an aggregation for the input data. Here, you can assume that the foreman would like to know the production per hour, month and year for each sweet.
define aggregation SweetProductionAggregation from SweetProductionStream select name, sum(amount) as totalAmount group by name aggregate every hour...yearThis calculates the total amount per hour, day, month and year by the arrival ime of each event. Incremental Aggregation can also be done for seconds, minutes, hours, days, months and years. However, in this sweet production scenario, aggregating by second holds no information value. Therefore, the sweet production is aggregated from hour to year.
Now, comes the question of when the production occurs. In the above aggregation, event arrival time is the time used in aggregation. The Sweet Bots send information directly from the factory floor to the server in the same network. Therefore, we can assume that the event arrival time is the production time.
The completed Siddhi application looks as follows.define stream SweetProductionStream(name string, amount long); @store(......) define aggregation SweetProductionAggregation from SweetProductionStream select name, sum(amount) as totalAmount group by name aggregate every min...year
User Scenario 2: Retrieval of data on demand
In the previous scenario, you defined the aggregation. Now let's see how to retrieve from it. Siddhi supports this functionality through correlation of data. In this tutorial, you are retrieving data via aggregation joins. For more information on correlating data through joins see Siddhi Query Guide - Joins.
First, let's define a stream to retrieve data. The foreman needs to see the hourly production of Sherbet Lemon for November 2016. Therefore, the criteria to retrieve values are as follows.
Therefore, the input stream needs to be defined as follows:
A possible output of this retrieval is the timestamp (beginning of each hour), the name of the sweet and the total amount. Therefore, let's define an output stream with these values as follows.
Now, let's use the aggregation, retrieval stream, and the output stream to define data correlation from an aggregation.
Aggregation for the selected period contains aggregation for all sweets. Therefore, let's join the aggregation, and the retrieval stream based on the sweet name to filter aggregations for Sherbet Lemon.You need to retrieve data relevant only for November 2017. Therefore, let's add it in the retrieval stream as the duration.
Let's add
intervalfor the retrieval to specify for which intervals you want the data to be retrieved.The completed statement including the output stream looks as follows:
In the above definition,
a.AGG_TIMESTAMPis the internal data of the aggregation defining the start of the time interval. For instance, in the November 2017 duration, there is a 24*30 hourly production aggregation. The first output event has the timestamp of the date and time of1st November 2017 00:00:00.
The completed Siddhi application with the possible sink and source configurations is as follows.@App:name('TotalProductionHistoryApp') @source(type = 'http', @map(type = 'json')) define stream SweetProductionStream(name string, amount long); @source(type = 'http', @map(type = 'json')) define stream GetTotalSweetProductionStream (name string, duration string, interval string); @sink(type='log', prefix='Hourly Production Stream') define stream HourlyProductionStream(AGG_TIMESTAMP long, name string, totalAmount long); @index('name') @store(type='rdbms', jdbc.url="jdbc:mysql://localhost:3306/SweetFactoryDB", username="root", password="root" , jdbc.driver.name="com.mysql.jdbc.Driver") define aggregation SweetProductionAggregation from SweetProductionStream select name, sum(amount) as totalAmount group by name aggregate every hour ... year; from GetTotalSweetProductionStream as b join SweetProductionAggregation as a on a.name == b.name within b.duration per b.interval select a.AGG_TIMESTAMP, a.name, a.totalAmount insert into HourlyProductionStream;