Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel2
minLevel2

Introduction

In the previous tutorial, you looked at the Siddhi real time data summarization capabilities by calculating the total production in the past minute.Now let's consider a more advanced scenario where you need to calculate the total value for a specific time period.

In this scenario, the foreman of the Sweet Factory needs to know the total production of Sherbet Lemon during each hour in November 2017.

It is costly to do this by recalculating the total for each and every event. What you need is a time based aggregation of the events in real time and retrieval on demand. Siddhi supports this functionality through the Incremental Aggregation concept.

Incremental Aggregation calculates the aggregated values continuously and stores them. These values can be retrieved efficiently from the store on demand. Furthermore, Incremental Aggregators support out of order event arrival with in-memory buffers for higher accuracy.

Info
iconfalse
titleThis tutorial covers the following concepts:
  • Introduction to incremental aggregation
  • Retrieval from incremental aggregation
Tip
titleBefore you begin:

In this scenario, information sent by the Sweet Bots are stored in a MySQL table named SweetFactoryDB. You need to download and install MySQL, and create this table before you carry out the tutorial steps.

Expand
titleClick here for instructions to configure the database table.
  1. Download and install MySQL Server.

  2. Download the MySQL JDBC driver.

  3. Unzip the downloaded MySQL driver zipped archive, and copy the MySQL JDBC driver JAR (mysql-connector-java-x.x.xx-bin.jar) into the <SP_HOME>/lib directory.

  4. Enter the following command in a terminal/command window, where username is the username you want to use to access the databases.
    mysql -u username -p 
  5. When prompted, specify the password you are using to access the databases with the username you specified.
  6. Add the following configuration under the Data Sources Configuration section of the <SP_HOME>/conf/editor/deployment.yaml file.

    Info

    You need to change the values for the username and password parameters to the username and password that you are using to access the MySQL database.

    Code Block
    languagexml
      - name: SweetFactoryDB
        description: Datasource used for Sweet Factory Supply Records
        jndiConfig:
          name: jdbc/test
          useJndiReference: true
        definition:
          type: RDBMS
          configuration:
            jdbcUrl: 'jdbc:mysql://localhost:3306/test'
            username: root
            password: root
            driverClassName: com.mysql.jdbc.Driver
            maxPoolSize: 50
            idleTimeout: 60000
            connectionTestQuery: SELECT 1
            validationTimeout: 30000
            isAutoCommit: false
  7. To create a database table named SweetFactoryDB, issue the following commands from the terminal.
    mysql> create database SweetFactoryDB;
    mysql> use SweetFactoryDB;
    mysql> source <SP_HOME>/wso2/editor/dbscripts/metrics/mysql.sql;
    mysql> grant all on SweetFactoryDB.* TO username@localhost identified by "password";
     

Tutorial steps

Lets get started!

User Scenario 1: Defining incremental aggregation

In this scenario, lets define an incremental aggregation to calculate the total production in an incremental manner, and store the results.

  1. Let's define an input stream as follows based on the data received from Sweet Bots. This is the same stream definition used in the previous tutorials to capture the name of the sweet category and the amount produced.

    Code Block
    languagesql
    define stream SweetProductionStream (name string, amount long);
  2. Now, let's define an aggregation for the input data. Here, you can assume that the foreman would like to know the production per hour, month and year for each sweet.

    Code Block
    languagesql
    define aggregation SweetProductionAggregation 
    from SweetProductionStream
    select name, sum(amount) as totalAmount
    group by name
    aggregate every hour...year

    This calculates the total amount per hour, day, month and year by the arrival ime of each event. Incremental Aggregation can also be done for seconds, minutes, hours, days, months and years. However, in this sweet production scenario, aggregating by second holds no information value. Therefore, the sweet production is aggregated from hour to year.

  3. Now, comes the question of when the production occurs. In the above aggregation, event arrival time is the time used in aggregation. The Sweet Bots send information directly from the factory floor to the server in the same network. Therefore, we can assume that the event arrival time is the production time.

    Info

    If you want you can be more accurate by appending the data sent by the SweetBots to include time as shown below.

    1. First define the input stream to include a timestamp:

      Panel

      define stream SweetProductionStream (name string, amount long, timestamp long);

    2. Then use the timestamp for aggregation as shown below.

      Panel

      aggregate by SweetProductionStream.timestamp every hour...year

    For this tutorial, let's continue to use the format mentioned first instead of the format in these substeps because the time differences are very slight in the hourly calculations.

    Info
    Code Block
    languagesql
    from SweetProductionStream
    select name, sum(amount) as totalAmount
    group by name

    This part of the aggregation specifies the following:

    • From where the information to be processed is taken (i.e., SweetProductionStream input stream)
    • The value you are aggregating. In this scenario, sum(amount) as totalAmount aggregates only the summation of values. The aggregation can also be minimum, maximum or average.
    • The group by clause is optional and can be ignored if all production must be aggregated.


    The completed Siddhi application looks as follows.

    Info
    • Here, you have defined a stream to get information and aggregated a value for minutes-year.
    • Aggregation can be stored in any type of store supported in Siddhi. For more information about the supported stores, see /wiki/spaces/TESB/pages/32604927.
    Code Block
    languagesql
    define stream SweetProductionStream(name string, amount long);
    
    @store(......)
    define aggregation SweetProductionAggregation 
    from SweetProductionStream
    select name, sum(amount) as totalAmount
    group by name
    aggregate every min...year

User Scenario 2: Retrieval of data on demand

In the previous scenario, you defined the aggregation. Now let's see how to retrieve from it. Siddhi supports this functionality through correlation of data. In this tutorial, you are retrieving data via aggregation joins. For more information on correlating data through joins see Siddhi Query Guide - Joins.

  1. First, let's define a stream to retrieve data. The foreman needs to see the hourly production of Sherbet Lemon for November 2016. Therefore, the criteria to retrieve values are as follows.

    SweetSherbet Lemon
    IntervalHourly
    DurationNovember 2017

    Therefore, the input stream needs to be defined as follows:

    Panel
    define stream GetTotalSweetProductionStream (name string, start_duration string, end_duration string, interval string);
  2. A possible output of this retrieval is the timestamp (beginning of each hour), the name of the sweet and the total amount. Therefore, let's define an output stream with these values as follows.

    Panel
    define stream HourlyProductionStream(AGG_TIMESTAMP long, name string, totalAmount long);
    Info

    In the above definition, AGG_TIMESTAMP is the internal reference of the aggregation defining the start of the time interval.

  3. Now, let's use the aggregation, retrieval stream, and the output stream to define data correlation from an aggregation.
    Aggregation for the selected period contains aggregation for all sweets. Therefore, let's join the aggregation, and the retrieval stream based on the sweet name to filter aggregations for Sherbet Lemon.

    Panel

    from GetTotalSweetProductionStream as b join SweetProductionAggregation as a
    on a.name == b.name

  4. You need to retrieve data relevant only for November 2017. Therefore, let's add it in the retrieval stream as the duration. 

    Excerpt
    hiddentrue

    In this scenario, the value for the duration attribute of the GetSweetProductionStream stream must be 2017-11-** **:**:**.

    Panel

    from GetTotalSweetProductionStream as b join SweetProductionAggregation as a
    on a.name == b.name
    within b.duration

    Info

    In the output event, the duration for which the data is retrieved must be represented in a specific format. For example, November 2017 can be represented as 2017-11-** **:**:**. The supported date formats are <yyyy>-<MM>-<dd> <HH>:<mm>:<ss> (if time is in GMT) and <yyyy>-<MM>-<dd> <HH>:<mm>:<ss> <Z> (if the time is not in GMT), here the ISO 8601 UTC offset must be provided for <Z> (e.g., +05:30-11:00).

    If the user needs a specific time duration, the query must be changed as follows. Both durations specified must adhere to the data formats required by Siddhi.

    Panel

    from GetSweetProductionStream as b join SweetProductionAggregation as a
    on a.name == b.name
    within b.start_duration, b.end_duration

    Excerpt
    hiddentrue

    In this scenario, the values for the start_duration and the end_duration attributes of the GetSweetOProductionStream are 2017-11-15 00:00:00 +05:30 and 2017-11-16 00:00:00 +05:30 respectively.

  5. Let's add interval for the retrieval to specify for which intervals you want the data to be retrieved.

    Excerpt
    hiddentrue

    In this scenario, the "interval" attribute value of the "GetSweetProductionStream" stream should be "hourly".

    Panel

    from GetTotalSweetProductionStream as b join SweetProductionAggregation as a
    on a.name == b.name
    within b.duration
    per b.interval

    Info

    Interval can be in the format of SECONDS, MINUTES, HOURS, DAYS, MONTHS or YEARS ( these values are not case sensitive).

    The completed statement including the output stream looks as follows:

    Panel

    from GetTotalSweetProductionStream as b join SweetProductionAggregation as a
    on a.name == b.name
    within b.duration
    per b.interval
    select a.AGG_TIMESTAMP, a.name, a.totalAmount
    insert into HourlyProductionStream;

    In the above definition, a.AGG_TIMESTAMP is the internal data of the aggregation defining the start of the time interval. For instance, in the  November 2017 duration, there is a 24*30 hourly production aggregation. The first output event has the timestamp of the date and time of 1st November 2017 00:00:00.

    The completed Siddhi application with the possible sink and source configurations is as follows.

    Code Block
    languagesql
    @App:name('TotalProductionHistoryApp')
    @source(type = 'http', @map(type = 'json'))
    define stream SweetProductionStream(name string, amount long);
    @source(type = 'http', @map(type = 'json'))
    define stream GetTotalSweetProductionStream (name string, duration string, interval string);
    
    
    @sink(type='log', prefix='Hourly Production Stream')
    define stream HourlyProductionStream(AGG_TIMESTAMP long, name string, totalAmount long);
    
    @index('supplier')
    @store(type='rdbms', jdbc.url="jdbc:mysql://localhost:3306/SweetFactoryDB", username="root", password="root" , jdbc.driver.name="com.mysql.jdbc.Driver")
    define aggregation SweetProductionAggregation 
    from SweetProductionStream
    select name, sum(amount) as totalAmount
    group by name
    aggregate every hour ... year;
    from GetTotalSweetProductionStream as b join SweetProductionAggregation as a
      on a.name == b.name 
      within b.duration
      per b.interval 
    select a.AGG_TIMESTAMP, a.name, a.totalAmount 
    insert into HourlyProductionStream;
    Excerpt
    hiddentrue

    Generating the output

    In this section, let's simulate events to the TotalProductionHistoryApp Siddhi application to see how information is stored and retrieved with incremental processing.

    Info

    In this scenario, note that you need to simulate events to two streams as follows:

    • SweetProductionStream: Events aare simulated to this input stream to imitate the events sent by Sweet Bots indicating the name of the sweet category and the amount produced. These events are sent in order to be stored in the MySQL database that you have already configured.
    • GetTotalSweetProductionStream: This is the event stream that captures events generated by the Factory Foreman when he requests for information about the hourly production for a specific period. These events are used as input based on which information is retrieved from the database you configured.


    1. First, let's simulate an event for the 
    2. The format of an event arriving at the GetSweetProductionStream input stream must be as follows.

      Panel

      name: "Sherbet lemon",
      duration : "2017-11-** **:**:**",
      interval: "hours"


      Let's click the following icon in the Stream Processor Studio to open the Event Simulator.

      In the Single Simulation tab, you need to select TotalProductionHistoryApp as the Siddhi app name, and GetTotalSweetProductionStream as the stream name. Then enter details as follows to send an event.

      namedurationinterval
      Sherbet Lemonhourly2017-11-15 **:**:**