WSO2 Machine Learner has two main entities namely, Datasets and Projects. Datasets may contain several versions. Once you upload a new dataset, WSO2 ML generates a new dataset version of it. However, if it is not a new dataset, WSO2 ML generates a dataset version out of the uploaded dataset. This process of uploading a dataset to WSO2 ML is shown in the flow diagram below.
Follow the steps below to geenrate a dataset using the ML UI.
...
Use the provided options to explore or delete the created dataset, or to create a project from the created dataset.
Info |
---|
In order to create a project from a dataset, the status of the dataset should be Processed. When the processing of a dataset is in progress, the status is displayed as Processing. When a dataset is not processed due to an error, the status is displayed as Failed. |
Creating a dataset version
Follow the steps below to create a new version of an existing dataset.
- Log in to the WSO2 ML UI, if you are not already logged in.
- Click DATASETS in the top menu as shown below.
- Click on the text which displays the number of versions available on the dataset as shown below.
- E nter a new version number for the dataset, and click CREATE VERSION as shown below.
- Enter the following details of the dataset.
- Click Create Version. Once the new version of the dataset is successfully created, you view it listed as follows. Use the provided options to explore or delete the created version, or to create a project from the created version.
Exploring the dataset
Once a dataset is uploaded, you can explore a dataset through multiple visualizations using the data exploration feature of WSO2 ML. Follow the steps below to explore an uploaded dataset.
- Log in to the WSO2 ML UI, if you are not already logged in.
- Click Datasets button as shown below.
Click the EXPLORE button of the dataset which you want to explore as shown below.
You view different perspectives on the dataset through four chart types as follows.Scatter plot & histogram
Scatter plot visualizes the relationship between the two selected features of the dataset. Moreover, histograms provide the user a graphical representation of the data distribution for the same two features you select. The scatter plot user interface allows you to select two numerical features from the dataset to be visualized through a scatter plot and histograms.
Parallel set
Parallel set is a visualization method used for categorical data. It adopts the layout of parallel coordinates, but substitutes the individual data points by a frequency-based representation. This abstract view is combined with a set of interactions. It supports visual data analysis of large and complex data sets. Using the parallel sets user interface, you can specify which categorical features to draw the diagram.
Trellis chart
Trellis chart is a series of graphs or charts based on the same scale and axes, allowing them to be easily compared. It uses multiple views to show different partitions of a dataset, and is useful for finding the structure and patterns in complex data. Trellis chart user interface allows you to select one categorical feature and multiple numerical features (bound to a maximum) to draw the diagram.
Cluster diagram
Cluster diagram is a general type of diagram, which depicts one or more clusters in a dataset. A cluster in general is a group or collection of discrete points that are close to each other. In explore view, a cluster diagram provides a perspective on data clusters for two selected numerical features. A popular clustering algorithm is applied on the data sample to derive data clusters. You can select two numerical features and the number of clusters required through the cluster diagram user interface.