Machine Learner Algorithms

WSO2 ML uses the following algorithms to create models using the data in a give data set.

Algorithm	Description	Type	Supported Publish/Download Formats	Related Samples	Backward Compatibility when migrated to a ML/DAS version using Apache Spark 1.6.2
`LINEAR REGRESSION`	Linear Regression algorithm trains a Generalized Linear Model that contains a relationship between independent variables (feature values in data) and the dependent variable (response variable in the data).	Numerical prediction	Serialized PMML	Generating a Model Using the Linear Regression Algorithm Generating a Tuned Model Using the Linear Regression Algorithm	Not Backward Compatible
`RIDGE REGRESSION`	Ridge Regression algorithm is a variant of Linear Regression where the loss function is the linear least squares function and the regularization is L2.	Numerical prediction	Serialized PMML	Generating a Model using the Ridge Regression Algorithm Generating a Tuned Model Using the Ridge Regression Algorithm	Not Backward Compatible
`LASSO REGRESSION`	Lasso Regression algorithm is a variant of Linear Regression trained with L1 prior as regularizer.	Numerical Prediction	Serialized PMML	Generating a Model Using the Lasso Regression Algorithm Generating a Tuned Model Using the Lasso Regression Algorithm	Not Backward Compatible
`LOGISTIC REGRESSION`	Logistic Regression algorithm is a Generalized Linear Model which predicts the probability of a binary outcome. Logistic function is used to determine the probabilities of the outcomes.	Binary Classfication	Serialized PMML	Generating a Model Using the Logistic Regression Algorithm Generating a Model Using the Logistic Regression Algorithm with HDFS Support Generating a Tuned Model Using the Logisitic Regression SGD Algorithm	Not Backward Compatible
`Support Vector Machine`	Support Vector Machine is a non-probabilistic binary classifier. It constructs a hyperplane or set of hyperplanes in a high (or infinite) dimensional space which generates a good separation of data points between classes.	Binary classification	Serialized PMML	Generating a Model using the Support Vector Machine Algorithm Generating a Tuned Model Using the Support Vector Machine Algorithm	Not Backward Compatible
`LOGISTIC REGRESSION L-BFGS`	Binary logistic regression can be generalized into multinomial logistic regression to train and predict multiclass classification problems. For k number of classes, It treats the first class as one class and the rest of the k-1 classes as another class and the class with the largest probability is chosen as the prediction. L-BFGS (Limited memory BFGS) is used as an optimization technique for faster convergence.	Multiclass Classification	Serialized	Generating a Tuned Model Using the Logistic Regression LBFGS Algorithm	Not Backward Compatible
`DECISION TREE`	Decision Tree algorithm creates a tree-like model that predicts the value of a target variable by learning simple decision rules inferred from the features of the dataset.	Multiclass Classification	Serialized	Generating a Model Using the Decision Tree Algorithm Generating a Tuned Model Using the Decision Tree Algorithm	Backward Compatible
`RANDOM FOREST CLASSIFICATION`	Random Forest Classification algorithm is an ensemble learning method which combines many decision trees in order to reduce the risk of overfitting. Different decision trees are trained with different bootstraps drawn from the dataset (both feature bootstrapping and data point bootstrapping). At prediction, majority vote is taken from the trained decision trees.	Multiclass Classification	Serialized	Generating a Model Using the Random Forest Classification Algorithm Generating a Tuned Model Using the Random Forest Algorithm	Backward Compatible
`RANDOM FOREST REGRESSION`	Random Forest Regression algorithm is an ensemble learning method which combines many decision tree regressors in order to reduce the risk of overfitting. Different decision tree regressors are trained with different bootstraps drawn from the dataset (both feature bootstrapping and data point bootstrapping). The value is predicted to be the average of the tree predictions.	Numerical Prediction	Serialized	Generating a Model Using the Random Forest Regression Algorithm	Backward Compatible
`NAIVE BAYES`	Naive Bayes algorithm assumes the independence between every pair of features in the dataset. It computes the conditional probability distribution of each feature given the class label, and then it applies Bayes’ theorem to compute the conditional probability distribution of label given a data point and use it for prediction. Negative feature values are not allowed when training a Naive Bayes model.	Multiclass Classification	Serialized	Generating a Model Using the Naive Bayes Algorithm Generating a Tuned Model Using the Naive Bayes Algorithm	Not Backward Compatible
`K-MEANS`	K-Means algorithm partitions the data points into a predefined number of clusters (k) in which each data point belongs to the cluster with the nearest mean, serving as a representative (cluster center) of the cluster.	Clustering	Serialized PMML	Generating a Model Using the K-Means Algorithm	Not Backward Compatible
`K-MEANS WITH UNLABLED DATA`	This is a state-of-art algorithm which performs K-means clustering algorithm on the training data. Data points which are beyond the cluster boundaries (according to a specific percentile value) are detected as anomalies. Labeled data is not required.	Anomaly Detection	Serialized	Generating a Model Using the K Means Anomaly Detection Algorithm with Unlabeled Data	Not Backward Compatible
`K-MEANS WITH LABLED DATA`	This is a state-of-art algorithm which performs K-means clustering algorithm on the training data. Data points which are beyond the cluster boundaries (according to a specific percentile value) are detected as anomalies. This is used when labels (normal and anomalous) are available.	Anomaly Detection	Serialized	Generating a Model Using the K Means Anomaly Detection Algorithm with Labeled Data Generating a Tuned Model Using the K Means Anomaly Detection Algorithm with Labeled Data	Not Backward Compatible
`STACKED AUTOENCODERS`	Stacked Autoencoders algorithms is a multi-layer feed-forward artificial neural network that is trained with stochastic gradient descent using back-propagation. The nodes in the input layer represent the features in the dataset and the nodes in the output layer represent the class labels of the outcomes.	Deep Learning	Serialized	Generating a Model Using the Stacked Autoencoders Algorithm	Backward Compatible
`COLLABORATIVE FILTERING (Explicit Data)`	Collaborative Filtering is used in recommendation systems and aims to fill in the missing entries of a user-item association matrix. This algorithm allows entries in the user-item matrix as explicit preferences(ratings) given by the user to the item. Recommendations are based on these explicitly rates.	Recommendation	Serialized	Generating a Model Using the Collaborative Filtering for Explicit Data Algorithm	Not Backward Compatible
`COLLABORATIVE FILTERING (Implicit Feedback Data)`	Collaborative Filtering is used in recommendation systems and aims to fill in the missing entries of a user-item association matrix. This algorithm allows preferences on the products to be implicit feedbacks such as views, clicks, purchases, likes, shares etc. Recommendations are based on these implicit feedbacks.	Recommendation	Serialized	Generating a Model Using the Collaborative Filtering for Implicit Feedback Data Algorithm	Not Backward Compatible

ML 1.1.1 and DAS 3.1.0 use Apache Spark 1.6.2 whereas the previous versions use Spark 1.4.1. Due, when you migrate to ML 1.1.1 or DAS 3.1.0 from a previous version, some of the models will not be backward compatible depending on the algorithm type used. The Backward Compatibility when migrated to a ML/DAS version using Apache Spark 1.6.2 column in the above table specifies which algorithm types do not allow the models to be backward compatible when migrated to a ML/DAS version that uses Apache Spark 1.6.2.

If you want to use a model that is not backward compatible after it is migrated, you need to rebuild it using the relevant analysis. For more information, see Generating Models - Creating a new model within an analysis.