Introduction
This sample demonstrates an analysis of data that are collected on the usage of the Wikipedia.
Prerequisites
Follow the steps below to set up the prerequisites before you start.
- Set up the general prerequisites required for WSO2 DAS.
Download a Wikipedia data dump, and extract the compressed XML articles dump file to a preferred location of your machine.
Building the sample
Follow the steps below to build the sample.
Uploading the Carbon Application
Follow the steps below to upload the Carbon Application (c-App) file of this sample. For more information, see Carbon Application Deployment for DAS.
- Log in to the DAS management console using the following URL: https://<DAS_HOST>:<DAS_PORT>/carbon/
- Click Main, and then click Add in the Carbon Applications menu.
- Click Choose File, and upload the
<DAS_HOME>/samples/capps/Wiki[pedia.car
file as shown below. - Click Main, then click Carbon Applications, and then click List view, to see the uploaded Carbon application as shown below.
Executing the sample
Follow the steps below to execute the sample.
Tuning the server configurations
The wikipedia dataset is transferred as a single article in a single event. Therefore, an event is relatively large (~300KB). Hence, you need to tune the server configurations as follows (i.e. specially the queue sizes available in the <DAS_HOME>/repository/conf/data-bridge/data-bridge-config.xml
file for data receiving, and <DAS_HOME>/repository/conf/analytics/analytics-eventsink-config.xml
file for data persistence, and also change the Thrift publisher related configurations in the <DAS_HOME>/repository/conf/data-bridge/data-bridge-config.xml
file).
Edit the values of the following properties in the
<DAS_HOME>/repository/conf/data-bridge/data-bridge-config.xml
file as shown below.<dataBridgeConfiguration> <maxEventBufferCapacity>5000</maxEventBufferCapacity> <eventBufferSize>2000</eventBufferSize> </dataBridgeConfiguration>
Edit the values of the following properties in the
<DAS_HOME>/repository/conf/data-bridge/data-agent-config.xml
file as shown below.<DataAgentsConfiguration> <Agent> <Name>Thrift</Name> <QueueSize>65536</QueueSize> <BatchSize>500</BatchSize> </Agent> </DataAgentsConfiguration>
Edit the values of the following properties in the
<DAS_HOME>/repository/conf/analytics/analytics-eventsink-config.xml
file as shown below.<AnalyticsEventSinkConfiguration> <QueueSize>65536</QueueSize> <maxQueueCapacity>1000</maxQueueCapacity> <maxBatchSize>1000</maxBatchSize> </AnalyticsEventSinkConfiguration
Running the data publisher
Navigate to <DAS_HOME>/samples/wikipedia/
directory in a new CLI tab, and execute the following command to run the data publisher: ant -Dpath=/home/laf/Downloads/enwiki-20150805-pages-articles.xml -Dcount=1000