Exploring Data
WSO2 Machine Learner has two main entities namely, Datasets and Projects. Datasets may contain several versions. Once you upload a new dataset, WSO2 ML generates a new dataset version of it. However, if it is not a new dataset, WSO2 ML generates a dataset version out of the uploaded dataset. This process of uploading a dataset to WSO2 ML is shown in the flow diagram below.
Follow the steps below to geenrate a dataset using the ML UI.
- Start the WSO2 ML server. For instructions on starting, see Running the Product.
Access the ML UI from your Web browser using the following URL: https://<ML_HOST>:<ML_PORT>/ml
You can find the URL of the WSO2 ML UI in the server startup logs in the CLI as follows: I
NFO{org.wso2.carbon.ml.core.internal.MLCoreDS} - WSO2 Machine Learner UI : https://127.0.0.1:9443/ml
- Log in to the ML UI as a user who is registered in WSO2 ML. For registering users, see User Management. You view the home page where two blocks are displayed for datasets and projects as shown below.
- Click Datasets block to navigate to DATASETS page as shown below.
- Click CREATE DATASET, to create a new dataset.
Enter the following details of the dataset.
The descriptions of the above fields are as follows.Field Description Dataset Name A unique name for the dataset.
Version Version of the dataset. Description A description for the dataset.
Source Type Type of the source where the data is retrieved from. It supports the following options.
- File - Retrieve data from the local file system.
- HDFS - Retrieve data from a Hadoop file system (HDFS). For instructions on providing HDFS support for the ML to retrieve data from it to create the dataset, see HDFS Support.
- DAS - Retrieve data from a WSO2 DAS table. For instructions on integrating WSO2 DAS for the ML to retrieve data from it to create the dataset, see Integration with WSO2 Data Analytics Server.
Data Source Source to retrieve the dataset file. It supports the options for the available source types as follows.
- File - file to upload
- HDFS - source path of HDFS
DAS - data table in the Data Access Layer of WSO2 DAS.
Default limit for dataset file size is 100 MB. You can increase (or decrease) this limit by changing the Java option
-Dorg.apache.cxf.io.CachedOutputStream.Threshold
value in<ML_HOME>/bin/wso2server.sh
file.If you get an error like below when you try to upload a dataset, it means the size of the dataset you are trying to update is larger than the current upload limit.
Data Format File type, whether the dataset format is CSV or TSV. Column Header Available If headers for columns are available in the CSV or TSV data file.
Once the dataset is successfully created, you view the created dataset listed as follows.
Note that the status of the dataset is displayed as Processing.- Click REFRESH in the CREATE DATASET page to refresh the page. The dataset will be displayed with the Processed status as follows.
Use the provided options to explore or delete the created dataset, or to create a project from the created dataset.
In order to create a project from a dataset, the status of the dataset should be Processed. When the processing of a dataset is in progress, the status is displayed as Processing. When a dataset is not processed due to an error, the status is displayed as Failed.
Creating a dataset version
Follow the steps below to create a new version of an existing dataset.
- Log in to the WSO2 ML UI, if you are not already logged in.
- Click DATASETS in the top menu as shown below.
- Click on the text which displays the number of versions available on the dataset as shown below.
- E nter a new version number for the dataset, and click CREATE VERSION as shown below.
- Enter the following details of the dataset.
- Click Create Version. Once the new version of the dataset is successfully created, you view it listed as follows. Use the provided options to explore or delete the created version, or to create a project from the created version.
Exploring the dataset
Once a dataset is uploaded, you can explore a dataset through multiple visualizations using the data exploration feature of WSO2 ML. Follow the steps below to explore an uploaded dataset.
- Log in to the WSO2 ML UI, if you are not already logged in.
- Click Datasets button as shown below.
Click the EXPLORE button of the dataset which you want to explore as shown below.
You view different perspectives on the dataset through four chart types as follows.Scatter plot & histogram
Scatter plot visualizes the relationship between the two selected features of the dataset. Moreover, histograms provide the user a graphical representation of the data distribution for the same two features you select. The scatter plot user interface allows you to select two numerical features from the dataset to be visualized through a scatter plot and histograms.
Parallel set
Parallel set is a visualization method used for categorical data. It adopts the layout of parallel coordinates, but substitutes the individual data points by a frequency-based representation. This abstract view is combined with a set of interactions. It supports visual data analysis of large and complex data sets. Using the parallel sets user interface, you can specify which categorical features to draw the diagram.
Trellis chart
Trellis chart is a series of graphs or charts based on the same scale and axes, allowing them to be easily compared. It uses multiple views to show different partitions of a dataset, and is useful for finding the structure and patterns in complex data. Trellis chart user interface allows you to select one categorical feature and multiple numerical features (bound to a maximum) to draw the diagram.
Cluster diagram
Cluster diagram is a general type of diagram, which depicts one or more clusters in a dataset. A cluster in general is a group or collection of discrete points that are close to each other. In explore view, a cluster diagram provides a perspective on data clusters for two selected numerical features. A popular clustering algorithm is applied on the data sample to derive data clusters. You can select two numerical features and the number of clusters required through the cluster diagram user interface.