Hive自定义函数
当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。 Hive目前只支持用java语言书写自定义函数。如果需要采用其他语言,比如Python,可以考虑上一节提到的transform语法来实现。 Hive支持三
当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。
Hive目前只支持用java语言书写自定义函数。如果需要采用其他语言,比如Python,可以考虑上一节提到的transform语法来实现。
Hive支持三种自定义函数,我们逐个讲解。
UDF
这是普通的用户自定义函数。接受单行输入,并产生单行输出。
编写java代码如下:
package com.oserp.hiveudf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
public classPassExam extendsUDF {
publicText evaluate(Integer score)
{
Text result = new Text();
if(score
result.set("Failed");
else
result.set("Pass");
return result;
}
}
然后,打包成.jar文件,比如hiveudf.jar。
执行以下语句:
add jar /home/user/hadoop_jar/hiveudf.jar;
create temporary function pass_scorecom.oserp.hiveudf.PassExam;
select stuNo,pass_score(score) from student;
输出结果为:
N0101 Pass
N0102 Failed
N0201 Pass
N0103 Pass
N0302 Pass
N0202 Pass
N0203 Pass
N0301 Failed
N0306 Pass
第一个语句注册jar文件;第二个语句为自定义函数取别名;第三个语句调用自定义函数。
Java代码中,自定义函数的类继承自UDF类,且提供了一个evaluate方法。这个方法接受一个整数值作为参数,并返回字符串。结构十分明了。其中的evaluate方法并没有作为interface提供,因为实际使用时,函数的参数个数及类型是多变的。
以上UDF名称是不区分大小写的,比如调用时写成PASS_SCORE也是可以的(因为它是hive中的别名,不是java类名)。
使用完成后,可调用以下语句删除函数别名:
Drop temporary function pass_score;
UDAF
用户定义聚集函数(User-defined aggregate function)。接受多行输入,并产生单行输出。比如MAX,COUNT函数。
编写以下Java代码:
packagecom.oserp.hiveudf;
importorg.apache.hadoop.hive.ql.exec.UDAF;
importorg.apache.hadoop.hive.ql.exec.UDAFEvaluator;
importorg.apache.hadoop.hive.serde2.io.DoubleWritable;
importorg.apache.hadoop.io.IntWritable;
publicclass HiveAvgextends UDAF {
public staticclass AvgEvaluate implements UDAFEvaluator
{
public staticclass PartialResult
{
public intcount;
public doubletotal;
public PartialResult()
{
count = 0;
total = 0;
}
}
private PartialResultpartialResult;
@Override
public voidinit() {
partialResult = new PartialResult();
}
public booleaniterate(IntWritable value)
{
// 此处一定要判断partialResult是否为空,否则会报错
// 原因就是init函数只会被调用一遍,不会为每个部分聚集操作去做初始化
//此处如果不加判断就会出错
if (partialResult==null)
{
partialResult =new PartialResult();
}
if (value !=null)
{
partialResult.total =partialResult.total +value.get();
partialResult.count=partialResult.count + 1;
}
return true;
}
public PartialResult terminatePartial()
{
returnpartialResult;
}
public booleanmerge(PartialResult other)
{
partialResult.total=partialResult.total + other.total;
partialResult.count=partialResult.count + other.count;
return true;
}
public DoubleWritable terminate()
{
return newDoubleWritable(partialResult.total /partialResult.count);
}
}
}
然后打包成jar文件,比如hiveudf.jar。
执行以下语句:
add jar/home/user/hadoop_jar/hiveudf.jar;
create temporary function avg_udf as'com.oserp.hiveudf.HiveAvg';
select classNo, avg_udf(score) from studentgroup by classNo;
输出结果如下:
C01 68.66666666666667
C02 80.66666666666667
C03 73.33333333333333
参照以上图示(来自Hadoop权威教程)我们来看看各个函数:
l Init在类似于构造函数,用于UDF的初始化。
注意上图中红色框中的init函数。在实际运行中,无论hive将记录集划分了多少个部分去做(比如上图中的file1和file2两个部分),init函数仅被调用一次。所以上图中的示例是有歧义的。这也是为什么上面的代码中加了特别的注释来说明。或者换一句话说,init函数中不应该用于初始化部分聚集值相关的逻辑,而应该处理全局的一些数据逻辑。
l Iterate函数用于聚合。当每一个新的值被聚合时,此函数被调用。
l TerminatePartial函数在部分聚合完成后被调用。当hive希望得到部分记录的聚合结果时,此函数被调用。
l Merge函数用于合并先前得到的部分聚合结果(也可以理解为分块记录的聚合结果)。
l Terminate返回最终的聚合结果。
我们可以看出merge的输入参数类型和terminatePartial函数的返回值类型必须是一致的。
UDTF
用户定义表生成函数(User-defined table-generating function)。接受单行输入,并产生多行输出(即一个表)。不是特别常用,此处不详述。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Go language provides two dynamic function creation technologies: closure and reflection. closures allow access to variables within the closure scope, and reflection can create new functions using the FuncOf function. These technologies are useful in customizing HTTP routers, implementing highly customizable systems, and building pluggable components.

In C++ function naming, it is crucial to consider parameter order to improve readability, reduce errors, and facilitate refactoring. Common parameter order conventions include: action-object, object-action, semantic meaning, and standard library compliance. The optimal order depends on the purpose of the function, parameter types, potential confusion, and language conventions.

The key to writing efficient and maintainable Java functions is: keep it simple. Use meaningful naming. Handle special situations. Use appropriate visibility.

1. The SUM function is used to sum the numbers in a column or a group of cells, for example: =SUM(A1:J10). 2. The AVERAGE function is used to calculate the average of the numbers in a column or a group of cells, for example: =AVERAGE(A1:A10). 3. COUNT function, used to count the number of numbers or text in a column or a group of cells, for example: =COUNT(A1:A10) 4. IF function, used to make logical judgments based on specified conditions and return the corresponding result.

The advantages of default parameters in C++ functions include simplifying calls, enhancing readability, and avoiding errors. The disadvantages are limited flexibility and naming restrictions. Advantages of variadic parameters include unlimited flexibility and dynamic binding. Disadvantages include greater complexity, implicit type conversions, and difficulty in debugging.

The benefits of functions returning reference types in C++ include: Performance improvements: Passing by reference avoids object copying, thus saving memory and time. Direct modification: The caller can directly modify the returned reference object without reassigning it. Code simplicity: Passing by reference simplifies the code and requires no additional assignment operations.

The difference between custom PHP functions and predefined functions is: Scope: Custom functions are limited to the scope of their definition, while predefined functions are accessible throughout the script. How to define: Custom functions are defined using the function keyword, while predefined functions are defined by the PHP kernel. Parameter passing: Custom functions receive parameters, while predefined functions may not require parameters. Extensibility: Custom functions can be created as needed, while predefined functions are built-in and cannot be modified.

Exception handling in C++ can be enhanced through custom exception classes that provide specific error messages, contextual information, and perform custom actions based on the error type. Define an exception class inherited from std::exception to provide specific error information. Use the throw keyword to throw a custom exception. Use dynamic_cast in a try-catch block to convert the caught exception to a custom exception type. In the actual case, the open_file function throws a FileNotFoundException exception. Catching and handling the exception can provide a more specific error message.
