Analyzing Wikepedia Data
Introduction
This sample demonstrates an analysis of data that are collected on the usage of the Wikipedia.
Prerequisites
Follow the steps below to set up the prerequisites before you start.
- Set up the general prerequisites required for WSO2 DAS.
Download a Wikipedia data dump, and extract the compressed XML articles dump file to a preferred location of your machine.
Building the sample
Follow the steps below to build the sample.
Uploading the Carbon Application
Follow the steps below to upload the Carbon Application (c-App) file of this sample. For more information, see Carbon Application Deployment for DAS.
- Log in to the DAS management console using the following URL: https://<DAS_HOST>:<DAS_PORT>/carbon/
- Click Main, and then click Add in the Carbon Applications menu.
- Click Choose File, and upload the
<DAS_HOME>/capps/Wiki[pedia.car
file as shown below.
- Click Main , then click Carbon Applications, and then click List view, to see the uploaded Carbon application as shown below.
Executing the sample
Follow the steps below to execute the sample.
Tuning the server configurations
The wikipedia dataset is transferred as a single article in a single event. Therefore, an event is relatively large (~300KB). Hence, you need to tune the server configurations as follows.
Edit the values of the following properties in the
<DAS_HOME>/repository/conf/data-bridge/data-bridge-config.xml
file as shown below, to tune the queue sizes available for data receiving .<dataBridgeConfiguration> <maxEventBufferCapacity>5000</maxEventBufferCapacity> <eventBufferSize>2000</eventBufferSize> </dataBridgeConfiguration>
Edit the values of the following properties in the
<DAS_HOME>/repository/conf/data-bridge/data-agent-config.xml
file as shown below., to tune the queue sizes available for data persistence .<DataAgentsConfiguration> <Agent> <Name>Thrift</Name> <QueueSize>65536</QueueSize> <BatchSize>500</BatchSize> </Agent> </DataAgentsConfiguration>
Edit the values of the following properties in the
<DAS_HOME>/repository/conf/analytics/analytics-eventsink-config.xml
file as shown below, to change the Thrift publisher related configurations .<AnalyticsEventSinkConfiguration> <QueueSize>65536</QueueSize> <maxQueueCapacity>1000</maxQueueCapacity> <maxBatchSize>1000</maxBatchSize> </AnalyticsEventSinkConfiguration
Running the data publisher
Navigate to <DAS_HOME>/samples/wikipedia/
directory in a new CLI tab, and execute the following command to run the data publisher: ant -Dpath=/home/laf/Downloads/enwiki-20150805-pages-articles.xml -Dcount=1000
Using the Analytics Dashboard
Follow the steps below to use the Analytics Dashboard to view the output.
- Log in to the DAS management console if you are not already logged in.
- Click Main, and then click Analytics Dashboard in the Dashboard menu.
- Log in to the Analytics Dashboard using
admin/admin
credentials. Click the following CREATE DASHBOARD button in the top navigational bar to create a new dashboard.
- Enter a Title and a Description for the new dashboard as shown below, and click Next as shown below.
Select a layout to place its components as shown below.
- Click Select button of the Single Comun layout. You view a layout editor with the chosen layout blocks marked using dashed lines.
- Click the following CREATE GADGET button in the top menu bar.
- Select the input data source as shown below, and click Next .
- Select Chart Type and enter the preferred x, y axis and additional parameters based on the selected chart type as shown below, and click Preview.
- Click Add to Gadget Store.
- Click the corresponding Design button of the
Wikepedia_Samples_Dashboard
to add theContributor Summary
gadget as shown below.
- Click the following gadget browser icon in the side menu bar.
You view the new gadget listed in the gadget browser as shown below.
Click on the new gadget, drag it out, and place it in the preferred grid of the selected layout in the dashboard editor as shown below.
- Click the following PREVIEW button in the top menu bar.
You view the preview of theWikepedia_Samples_Dashboard
with theContributor Summary
gadget added to it as shown below.