Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This sample demonstrates an analysis of data that are collected on the usage of the Wikipedia.

Anchor
Prerequisites
Prerequisites
Prerequisites

Follow the steps below to set up the prerequisites before you start.

  1. Set up the general prerequisites required for WSO2 DAS

...

  1. .
  2. Download a Wikipedia data dump, and extract the compressed XML articles dump file to a preferred location of your machine.

Building the sample

Follow the steps below to build the sample.

...

  1. Log in to the DAS management console using the following URL: https://<DAS_HOST>:<DAS_PORT>/carbon/
  2. Click Main, and then click Add in the Carbon Applications menu.
  3. Click Choose File, and upload the <DAS_HOME>/samples/capps/Wiki[pedia.car file as shown below.
  4. Click Main, then click Carbon Applications, and then click List view, to see the uploaded Carbon application as shown below. 

...

Executing the sample

Follow the steps below to execute the sample.

Tuning the server configurations

The wikipedia dataset is transferred as a single article in a single event. Therefore, an event is relatively large (~300KB). Hence, you need to tune the server configurations as follows (i.e. specially the queue sizes available in the <DAS_HOME>/repository/conf/data-bridge/data-bridge-config.xml file for data receiving, and <DAS_HOME>/repository/conf/analytics/analytics-eventsink-config.xml file for data persistence, and also change the Thrift publisher related configurations in the <DAS_HOME>/repository/conf/data-bridge/data-bridge-config.xml file).

  1. Edit the values of the following properties in the <DAS_HOME>/repository/conf/data-bridge/data-bridge-config.xml file as shown below.

    Code Block
    languagexml
    <dataBridgeConfiguration>
    	<maxEventBufferCapacity>5000</maxEventBufferCapacity>
    	<eventBufferSize>2000</eventBufferSize>
    </dataBridgeConfiguration>
  2. Edit the values of the following properties in the <DAS_HOME>/repository/conf/data-bridge/data-agent-config.xml file as shown below. 

    Code Block
    languagexml
    <DataAgentsConfiguration>
    	 <Agent>
            <Name>Thrift</Name>
    			<QueueSize>65536</QueueSize>
           	 	<BatchSize>500</BatchSize>
    	</Agent>
    </DataAgentsConfiguration>
  3. Edit the values of the following properties in the <DAS_HOME>/repository/conf/analytics/analytics-eventsink-config.xml file as shown below.
     

    Code Block
    languagexml
    <AnalyticsEventSinkConfiguration>
    	<QueueSize>65536</QueueSize>
    	<maxQueueCapacity>1000</maxQueueCapacity>
    	<maxBatchSize>1000</maxBatchSize>
    </AnalyticsEventSinkConfiguration>
Running the data publisher

Navigate to <DAS_HOME>/samples/wikipedia/ directory in a new CLI tab, and execute the following command to run the data publisher: ant -Dpath=/home/laf/Downloads/enwiki-20150805-pages-articles.xml -Dcount=1000

Info

Set the values of the -Dpath and -Dcount Java system properties in the above command, to point them to the location where you stored the Wikipedia article XML dump file which you downloaded in Analysing Wikepedia Data, and to the number of
articlesyou need to publishe as events out of the total dataset respectively. (E.g. -Dcount=-1 to publish all articles.)

This sends each log line as an event to the event stream which is deployed through the above C-App.