Creating Spark User Defined Functions

Apache Spark allows UDFs (User Defined Functions) to be created if you want want to use a feature that is not available for Spark by default. WSO2 DAS has an abstraction layer for generic Spark UDF (User Defined Functions) which makes it convenient to introduce UDFs to the server.

The following query is an example of a custom UDF.

SELECT id, concat(firstName, lastName) as fullName, department FROM employees;

The steps to create a custom UDF are as follows.

Step 1: Create the POJO class

The following example shows the UDF POJO for the StringConcatonator custom UDF class. The name of the Spark UDF should be the name of the method defined (concat in this example). This will be used when calling the UDF with Spark. e.g., concat(“cusom”,”UDF”) returns the String “Custom UDF”.

/**
* This is an UDF class supporting string concatenation for spark SQL
*/
public class StringConcatonator {

   /**
    This UDF returns the concatenation of two strings
    */
   public String concat(String firstString, String secondString) {
       return firstString + secondString;
   }
}

Method overloading for UDFs is not supported. Different UDFs should have different method names for the expected behaviour.

Step 2: Package the class in a jar

The custom UDF class you created should be bundled as a jar and added to <DAS_HOME/repository/components/lib directory.

Step 3: Update Spark UDF configuration file

Add the newly created custom UDF to the <DAS_HOME>/repository/conf/analytics/spark/spark-udf-config.xml file as shown in the example below.

<udf-configuration>
	<custom-udf-classes>
    	<class-name>org.james.customUDFs.StringConcatonator</class-name>
    	<class-name>org.wso2.carbon.analytics.spark.core.udf.defaults.TimestampUDF</class-name>
    	</custom-udf-classes>
</udf-configuration>

This configuration is required for Spark to identify and use the newly defined custom UDF.

Spark adds all the methods in the specified UDF class as custom UDFs.