Model Evaluation Measures

This section explains how to evaluate the models generated by WSO2 ML with regard to their accuracy. The following topics are covered.

Terminology of Binary Classification Metrics

Binary Classification Metrics refer to the following two formulas used to calculated the reliability of a binary classification model.

Name	Formula
True Positive Rate (Sensitivity)	TPR = TP / P = TP / (TP + FN)
True Negative Rate (Specificity)	SPC = TN / N = TN / (TN + FP)

The following table explains the abbreviations used in the above formulas.

Abbreviation	Expansion	Meaning
P	Positives	The total number of positive outcomes (i.e. the total number of items that actually belong to the positive class).
N	Negatives	The total number of negative items (i.e. the total number of items that actually belong to the negative class).
TP	True Positive	TP data items: actually belong to the positive class are correctly included in the positive class
FP	False Positive	FP data items: actually belong to the negative class are incorrectly included in the positive class
TN	True Negative	TN data items: actually belong to the negative class are correctly included in the negative class
FN	False Negative	FN data items: actually belong to the positive class are incorrectly included in the negative class

Model evaluation measures

The following methods are used to evaluate the performance of models in terms of accuracy.

Measure	Available for
Confusion Matrix	Binary classification models Multi-class classification models Anomaly detection models Deep learning models
Accuracy	Binary classification models Multi-class classification models Anomaly detection models Deep learning models
ROC Curve	Binary classification models
AUC	Binary classification models
Feature Importance	Binary classification Numerical prediction
Predicted vs Actual	Binary classification models Multi-class classification models Deep learning models
MSE	Numerical prediction Recommendation models
Residual Plot	Numerical Prediction
Precision and Recall	Anomaly detection models
F1 Score	Anomaly detection models

Confusion Matrix

The confusion matrix is a table layout that visualises the performance of a classification model by displaying the actual and predicted points in the relevant grid. The confusion matrix for a binary classification is as follows:

The confusion matrix for a multi class (n number of classifications) classification is as follows:

This matrix allows you to identify which points are correctly classified and which points are incorrectly classified. The points in the grids with matching actual and predicted classes are the correct predictions and these points should be maximised for greater accuracy. The green grids in the above images are the grids for the correctly classified points. In an ideal scenario, all other grids should have zero points.

The following is an example of a confusion matrix with both correctly classified points as well as incorrectly classified points.

Accuracy

The accuracy of a model can be calculated using the following formula.

Accuracy = Correctly Classified Points / Total Number of Points

For a binary classification model, this can be calculated as follows

Accuracy = (TP + TN) / (TP + TN + FP + FN) = (TP + TN) / (P + N)

e.g., The rate of accuracy can be calculated as follows based on the example for the Confusion Matrix above.

Correctly classified points = 12 + 16 + 16 = 44

Total number of points = 12 + 16 + 16 + 1 + 1 + 1 = 47

Accuracy = 44 / 47 = 93.62%

You can find this metric for classification models in the model summary as shown in the above image.

ROC Curve

This illustrates the performance of binary classifier model by showing the TPR (True Positive Rate> against the SPC (False Positive Rate) for different threshold values. A completely accurate model would pass through the 0, 1 coordinate (i.e. TPR of 1 and SPC of 0) in the upper left corner of the plot. However, this is not achievable in practical scenarios. Therefore, when comparing models, the model with the ROC curve closest to the 0, 1 coordinate can be considered the best performing model in terms of accuracy. The best threshold for this model is the threshold associated with the point that is closest to the 0,1 coordinate on the ROC curve. You can find ROC curve for a particular binary classification model under the model summary in WSO2 ML UI.

AUC

AUC (Area Under the Curve) is another metric for accuracy of a binary classification model that is associated with the ROC curve. A model with greater accuracy should have a value closer to 1 for AUC (area of the ROC curve). Therefore, when comparing the accuracy of multiple models using the AUC, the one with the highest AUC can be considered the best performing model.

You can find the AUC value for a particular model in its ROC curve in the model summary (see the image of the ROC curve in the previous section with text ROC Curve (AUC = 0.619).

Feature Importance

This chart visualizes the importance (weight) of each feature in the model according to its significance in creating the final model. In regression models (numerical predictions), each of these weights represents the amount by which the response variable would be changed, when the respective predictor variable is increased by one unit. By looking at this chart you can make decisions for feature selection. This chart type is available for both binary classification and numerical prediction models.

Predicted vs Actual

This chart plots the data points according to the correctness of the classification. You can select two dataset features to be visualized and the plot will display data distribution with the classification accuracy (correct/incorrect) for each point.

MSE

MSE (Mean Squared Error) is the average of the squared errors of the prediction. An error is the difference between the actual value and the predicted value. Therefore, a better performing model should have a comparatively lower MSE. This metric is widely used to evaluate the accuracy of numerical prediction models. You can find this metric for numerical prediction models in the model summary as shown in the above image.

Residual Plot

Residual plot shows the residuals on the y-axis and a predictor variable (feature) on the x-axis. A residual is defined as the difference between the observed (actual) value and the predicted value of the response variable. A model can be considered accurate then the residual points are:

Randomly distributed (do not form a pattern)
Centered around zero on the vertical axis (indicating that, there are equal numbers of positive and negative values)
Precisely distributed around zero in the vertical axis (indicating that there are no very large positive or negative residual values)

If the above conditions are not satisfied, it is possible that there are some missing/hidden factors/predictor variables that have not been taken into account. Residual plot is available for numerical prediction models. You can select a dataset feature to be plotted with its residuals.

Precision and Recall

Precision and Recall are performance measures used to evaluate search strategies. They are typically used in document retrieval scenarios.

When search is carried out on a set of records in a database, some of the records are relevant to the search and the rest of the records irrelevant to the search. However, the actual set of records retrieved may not perfectly match the set of records that are relevant to the search. Based on this, Precision and Recall can be described as follows.

Measure

Definition

Formula

Precision

The number of selected items that are relevant.

TP / (TP + FP)

Recall

The number of relevant items that are selected.

This is the same as the TPR.

TP / (TP + FN)

F1 Score

The F1 Score gives the weighted average of Precision and Recall. It is expressed as a value between 0 and 1, where 0 indicates the worst performance and 1 indicates the best performance.

2TP / (2TP + FP + FN)