Implementing a custom UDF in Hive to help validate source reliability over geographic event data

There are many operations you will want to repeat across various data sources and tables in Hive. For this scenario, it makes sense to write your own user-defined function (UDF). You can write your own subroutine in Java for use on any Writable input fields and to invoke your function from Hive scripts whenever necessary. This recipe will walk you through the process of creating a very simple UDF that takes a source and returns yes or no for whether that source is reliable.

Getting ready

Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Hive 0.7.1 installed on your client machine and on the environment path for the active user account.

This recipe depends on having the Nigera_ACLED_cleaned.tsv dataset loaded into a Hive table with the name acled_nigeria_cleaned with the following fields mapped to the respective datatypes.

Issue the following command to the Hive client:

describe acled_nigeria_cleaned;

You should see the following response:

OK
loc  string
event_date  string
event_type  string
actor  string
latitude  double
longitude  double
source string
fatalities  int

Additionally, you will need to place the following recipe's code into a source package for bundling within a JAR file of your choice. This recipe will use <myUDFs.jar> as a reference point for your custom JAR file and <fully_qualified_path_to_TrustSourceUDF> as a reference point for the Java package your class exists within. An example of a fully qualified path for a pattern would be java.util.regex.Pattern.

In addition to the core Hadoop libraries, your project will need to have hive-exec and hive-common JAR dependencies on the classpath for this to compile.

How to do it...

Perform the following steps to implement a custom Hive UDF:

  1. Open a text editor/IDE of your choice, preferably one with Java syntax highlighting.
  2. Create TrustSourceUDF.java at the desired source package. Your class should exist at some package <fully_qualified_path>.TrustSourceUDF.class.
  3. Enter the following source as the implementation for the TrustSourceUDF class:
    import org.apache.hadoop.hive.ql.exec.UDF;
    import org.apache.hadoop.io.Text;
    import java.lang.String;import java.util.HashSet;
    import java.util.Set;
    
    
    public class TrustSourceUDF extends UDF {
    
        private static Set<String> untrustworthySources = new HashSet<String>();
        private Text result = new Text();
    
        static {
         untrustworthySources.add("");
         untrustworthySources.add(""""
         http://www.afriquenligne.fr/3-soldiers"");
         untrustworthySources.add("Africa News Service");
         untrustworthySources.add("Asharq Alawsat");
         untrustworthySources.add("News Agency of Nigeria (NAN)");
         untrustworthySources.add("This Day (Nigeria)");
        }
    
        @Override
        public Text evaluate(Text source) {
    
             if(untrustworthySources.contains(source.toString())) {
                 result.set("no");
             } else {
                 result.set("yes");
             }
             return result;
        }
    }
  4. Build the containing JAR <myUDFs.jar> and test your UDF through the Hive client. Open a Hive client session through the command shell. Hive should already be on the local user environment path. Invoke the Hive shell with the following command:
    hive
  5. Add the JAR file to the Hive session's classpath:
     add jar /path/to/<myUDFs.jar>;

    You will know that the preceding operation succeeded if you see the following messages indicating that the JAR has been added to the classpath and the distributed cache:

    Added /path/to/<myUDFs.jar> to class path
    Added resource: /path/to/<myUDFs.jar>     
  6. Create the function definition trust_source as an alias to TrustSourceUDF at whatever source package you specified in your JAR:
    create temporary function trust_source as '<fully_qualified_path_to_TrustSourceUDF>';

    You should see the shell prompt you that the command executed successfully. If you see the following error, it usually indicates your class was not found on the classpath:

    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
  7. Test the function with the following query. You should see mostly yes printed on each line of the console, with a few no's here and there:
    select trust_source(source) from acled_nigeria_cleaned;

How it works...

The class TrustSourceUDF extends UDF. No methods are required for implementation; however, in order for the class to function at Hive runtime as a UDF, your subclass must override evaluate(). You can have one or more overloaded evaluate() methods with different arguments. Ours only needs to take in a source value to check.

During class initialization, we set up a static instance of the java.util.Set class named untrustworthySources. Within a static initialization block, we set up a few sources by their names to be flagged as unreliable.

Note

The entries here are purely arbitrary and should not be considered reliable or unreliable outside of this recipe.

We flag an empty source as unreliable.

When the function is invoked, it expects a single Text instance to be checked against the sources we've flagged as unreliable. Return yes or no depending on whether the given source appears in the set of unreliable sources or not. We set up the private Text instance to be re-used every time the function is called.

Once the JAR file containing the class is added to the classpath, and we set up our temporary function definition, we can now use the UDF across many different queries.

There's more...

User-defined functions are a very powerful feature within Hive. The following sections list a bit more information regarding them:

Check out the existing UDFs

The Hive documentation has a great explanation of the built-in UDFs bundled with the language. A great write up is available at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-BuiltinAggregateFunctions%28UDAF%29.

To see which functions are available in your specific version of Hive, issue the following command in the Hive shell.

show functions;

Once you pinpoint a function that looks interesting, learn more information about it from the Hive wiki or directly from the Hive shell by executing the following command:

describe function <func>;

User-defined table and aggregate functions

Hive UDFs do not need to have a one-to-one interaction for input and output. The API allows the generation of many outputs from one input (GenericUDTF) as well as custom aggregate functions that take a list of input rows and output a single value (UDAF).

Export HIVE_AUX_JARS_PATH in your environment

Adding JAR files dynamically to the classpath is useful for testing and debugging, but can be cumbersome if you have many libraries you repeatedly wish to use. The Hive command line interpreter will automatically look for the existence of HIVE_AUX_JARS_PATH in the executing user's environment. Use this environment variable to set additional JAR paths that will always get loaded in the classpath of new Hive sessions for that client machine.

See also

  • Using Hive date UDFs to transform and sort event dates from geographic event data
  • Using Hive to build a per-month report of fatalities over geographic event data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.237.164