FAQ

I see an exception stating - java.io.IOException: Cannot run program "null/bin/java" when running BAM? What is going wrong?

This happens when you have not set the JAVA_HOME environment variable and pointed to the installed JRE location. This needs to be explicitly set in your environment.

Do I need a Hadoop cluster to run WSO2 BAM?

No, you do not need one. If you have not setup a Hadoop cluster, BAM performs analyzer operations using Hadoop's "local mode". This means that the operations will run in the same machine and compute the results.

When do I need a Hadoop cluster?

If you have a large amount of data stored in your stream definitions, your analytics will take a long time to run on a single machine. If you setup a Hadoop cluster, and configure BAM to use the Hadoop cluster, BAM will delegate the computations to the Hadoop cluster. The Hadoop cluster can distribute the processing of data among its nodes and compute the results in a parallel manner.

How can I scale up BAM?

If you want to scale up analyzers for a large data volume, you can use a Hadoop cluster. If you want to scale up the BAM to receiver a large amount of data, you can setup multiple receiver nodes fronted by a load balancer. If you want to scale up the dashboards (presentation layer), you can setup multiple dashboard nodes fronted by a load balancer.

I only see one BAM distribution. How do I create a BAM receiver node/ analyzer node/ dashboard node?

The BAM distribution will contain all the features you need. We prepare each node by uninstalling relevant features. You can do this by the feature installation/ uninstallation ability that comes with WSO2 BAM. If you want to create a receiver node, you uninstall the analytics feature and the dashboard feature, and you will have a receiver node.

What is the language used to define analytics in BAM?

The Hive query language is used to define the analytics in BAM.

Can I send custom data/ events to BAM?

Yes, you can. There is an SDK provided for this. For a detailed article on how to send custom data to BAM using this SDK is available at http://wso2.org/library/articles/2012/07/creating-custom-agents-publish-events-bamcep/.

You can also send data to BAM using the REST API. Please have a look at the REST API sample to see how this is used.

How do I define a custom KPI in BAM?

The model for this is to first publish the custom data. After you send custom data to BAM, you need to define your analytics to match your KPI. To visualize the result of this KPI, you can write a gadget using HTML and JS or use the Gadget generation tool.

Please refer the custom KPI definition and monitoring sample for all artifacts related to defining a custom KPI.

How can I setup a Hadoop cluster?

A great resource for this exists at http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/.

I get an exception stating - ERROR {org.apache.had oop.hive.ql.exe c.ExecDriver} - Job Submission failed with exception 'java.io.IOExce ption (Cannot run program "chmod": CreateProcess error=2, The system cannot find the file specified)' java.io.IOExcep tion: Cannot run program "chmod": CreateProcess error=2, The system cannot find the file specified. What is going wrong?

This happens when you try to run BAM on Windows without installing cygwin. The BAM analyzer engine depends on Apache Hadoop and Hadoop requires Cygwin in order to run in Windows. So, make sure that you have installed basic, net (OpenSSH,tcp_wrapper packages) and security related Cygwin packages if you are working on Windows before using BAM. After installing Cygwin, please update your PATH variable by appending ";C:\cygwin\bin". This is required since the default installation of Cygwin might not do this.

Can BAM do real time analytics?

BAM is built to do batch based analytics on large data volumes. However, it can do real time analytics by installing the WSO2 CEP feature on top of the BAM server. By design, BAM and CEP use the same components to send and receive data, making them compatible to process data. The WSO2 CEP server is a powerful real time analytics engine capable of defining queries based on temporal windows, pattern matching and much more.

Why does BAM use Cassandra?

BAM is intended to store large amounts of data, and Cassandra is a proven NoSQL store that allows to store TeraBytes and even PetaBytes of data with ease. It has no single points of failure and is very easy to scale. All these reasons chose us to choose Cassandra as the primary data store. But, this does not mean another data store cannot be supported. By extending necessary interfaces a different data stores can be plugged in as well.

I see that in the BAM samples, it writes the results to a RDBMS? Why does it do this?

The BAM does this for 2 reasons. One is to promote a polyglot data architecture, i.e. BAM initially stores data in Cassandra, but that does not mean everything has to be analyzed and stored back to Cassandra. It can be stored in a RDBMS, Cassandra or any other data store. The second is that there is extensive support for many 3rd party reporting tools such as Jasper, Pentaho, etc. already support RDBMSs. With this sort of support for a polyglot data architecture, any reporting engine or dashboard can be plugged into BAM without any extra customization effort.

I get a read timeout in the analytics UI after executing a query?

This happens when there is a large amount of data to analyze. The UI will timeout after 10 minutes, if the data to be processed takes more time than this.

How can I save summarized data into different database?

Put the jdbc diver in BAM_HOME/repository/components/lib/ and change the values of the following properties accordingly.
'mapred.jdbc.driver.class' //jdbc driver class
                'mapred.jdbc.url' // Connection URL
                'mapred.jdbc.username' // Username
                'mapred.jdbc.password' // Password
'hive.jdbc.table.create.query' // DB specfic query for creating the table.

Once I run the hive query to persist data to H2 database I can't configure it to mysql ?

Workaround is to add "DROP TABLE {hive table name}" before all "CREATE TABLE ***" , You have to do this if you are doing any changes to meta tables in hive script.

I am getting error "Thrift error occurred during processing of message..." Any idea why?

If you are getting the following while trying to publish data from BAM mediator data agent then check whether you have specified the receiver and authentication ports properly and have not mixed them up. The default values are

receiver port = 7611

authentication port = 7711

TID: [0] [BAM] [2012-11-28 22:46:40,102] ERROR {org.apache.thrift.server.TThreadPoolServer} - Thrift error occurred during processing of message. {org.apache.thrift.server.TThreadPoolServer}
   org.apache.thrift.protocol.TProtocolException: Bad version in readMessageBegin
       at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:208)
       at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
       at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176)
       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
       at java.lang.Thread.run(Thread.java:619)

Even though tables have been created at summary database no data are there after running the Hive query without any exceptions.

Make sure that you have given Cassandra connection parameters correctly (cassandra host, port, column family name) in CREATE TABLE definition which maps Hive table to the Cassandra column family. Specially if you have given a non existent column family name the query will succeed without any errors without returning any results. So be aware of this.