Creating Spark User Defined Functions

Apache Spark allows UDFs (User Defined Functions) to be created if you want want to use a feature that is not available for Spark by default. WSO2 DAS has an abstraction layer for generic Spark UDF (User Defined Functions) which makes it convenient to introduce UDFs to the server.

The following query is an example of a custom UDF.

SELECT id, concat(firstName, lastName) as fullName, department FROM employees;

The steps to create a custom UDF are as follows.

Step 1: Create the POJO class

The following example shows the UDF POJO for the StringConcatonator custom UDF class. The name of the Spark UDF should be the name of the method defined (concat in this example). This will be used when calling the UDF with Spark. e.g., concat(“cusom”,”UDF”) returns the String “Custom UDF”.

/**
* This is an UDF class supporting string concatenation for spark SQL
*/
public class StringConcatonator {

   /**
    This UDF returns the concatenation of two strings
    */
   public String concat(String firstString, String secondString) {
       return firstString + secondString;
   }
}

Apache Spark does not support primitive data type returns. Therefore, all the methods in a POJO class should return the wrapper class of the corresponding primitive data type.

e.g., A method to add two integers should be defined as shown below.
```
public Integer AddNumbers(Integer a)
{
}
```
Method overloading for UDFs is not supported. Different UDFs should have different method names for the expected behaviour.
If the user consumes a data type that is not supported for Apache Spark, the following error appears when you start the DAS server.
```
Error initializing analytics executor: Cannot determine the return DataType
```
For a list of return types supported for Apache Spark, see Spark SQL and DataFrames and Datasets Guide - Data Types.
If you need to use one or more methods that are not UDF methods, and they contain return types that are not supported for Apache Spark, you can use a separate class to define them. This class does not have to be added to the <DAS_HOME>/repository/conf/analytics/spark/spark-udf-config.xml file.

Step 2: Package the class in a jar

The custom UDF class you created should be bundled as a jar and added to <DAS_HOME/repository/components/lib directory.

Step 3: Update Spark UDF configuration file

Add the newly created custom UDF to the <DAS_HOME>/repository/conf/analytics/spark/spark-udf-config.xml file as shown in the example below.

<udf-configuration>
	<custom-udf-classes>
    	<class-name>org.james.customUDFs.StringConcatonator</class-name>
    	<class-name>org.wso2.carbon.analytics.spark.core.udf.defaults.TimestampUDF</class-name>
    	</custom-udf-classes>
</udf-configuration>

This configuration is required for Spark to identify and use the newly defined custom UDF.

Spark adds all the methods in the specified UDF class as custom UDFs.