HDFS Support

Introduction

WSO2 Machine Learner supports HDFS access in the following aspects as depicted in the below diagram.

ML HDFS support

Read datasets from HDFS - you can deploy a dataset in a HDFS, and point to it when uploading a new dataset.
Store datasets in HDFS - you can specify HDFS as the target storage type when uploading a dataset. This is useful if the dataset is large in size. The stored dataset is read by the WSO2 ML server side operations.
Spark read datasets from HDFS - if a dataset is stored in the HDFS, Spark jobs which are responsible for running the machine learning algorithms, read the datasets from HDFS.
Persist ML models in HDFS - you can specify to store built models in the HDFS that the WSO2 ML server is connected to.
Retrieve Models from HDFS - if a model is persisted in HDFS, it is retrieved back for prediction.

The implementation of how WSO2 ML supports HDFS in the above aspects are described below.

Upload a dataset from HDFS

Start the Apache Hadoop server by executing the following command: {HADOOP_HOME}/sbin/start-all.sh
Access the Hadoop UI using the following URL: http://localhost:50070
Upload your dataset file to HDFS.

You can use the HDFS Writer utility tool for this.
Start the WSO2 ML server. For instructions on starting, see Running the Product.
Log in to the WSO2 ML UI using the following URL: https://<ML_HOST>:<ML_PORT>/ml
Click the Datasets button as shown below.
Click ADD DATASET button in the top menu.
Enter the following details as shown below.
- Enter a dataset name for Dataset Name.
- Enter the version number for Version.
- Select HDFS fro the Source Type.
- Enter the HDFS URL of the dataset file for Data Source.
- Select the Data Format.
- Select Yes for Column header available, if you get column headers defined in the dataset.
Click Create Dataset. You view the new dataset added to the list of all available datasets on successful dataset creation.

Store datasets in HDFS

Dataset storage is a WSO2 Machine Learner server configuration. By default it is uses the file system. For instructions on changing the dataset storage to HDFS, see Storage Configurations.

Spark read datasets from HDFS

Apache Spark jobs which are responsible for running the machine learning algorithms, read the datasets from dataset storage configured through the Storage Configuration. You need not specifically configure HDFS support for Spark.

Persist models in HDFS

Model storage is a WSO2 Machine Learner server configuration. By default it is using file system. For instructions on changing the dataset storage to HDFS, see Storage Configurations.

Retrieve models from HDFS

WSO2 ML retrieves models for prediction from a defined model storage. This is a WSO2 Machine Learner server configuration. For instructions on changing the dataset storage to HDFS, see Storage Configurations.

For information on a sample demonstrating HDFS support in WSO2 ML, see Generating a Model Using the Logistic Regression Algorithm with HDFS Support.