Hive中Group By的去重
在Hive的是用中,我们经常会有这种需求: 按照同一个id进行Group By,然后对另一个字段去重,例如下面得数据: id pic1.jpg2.jpg1.jpg 此时,是用DISTINCT或者2 col得Group By都是不行得,我们可以用这个UDAF:collect_set(col),它将对同一个group by 得ke
在Hive的是用中,我们经常会有这种需求:
按照同一个id进行Group By,然后对另一个字段去重,例如下面得数据:
id pic 1.jpg 2.jpg 1.jpg
此时,是用DISTINCT或者2 col得Group By都是不行得,我们可以用这个UDAF:collect_set(col),它将对同一个group by 得key进行set去重后,转换为一个array。
再举一个例子,我们可以对pic进行去重,拼接:
SELECT id, CONCAT_WS(',', COLLECT_SET(pic)) FROM tbl GROUP BY id
在这里CONCAT_WS是UDF,COLLECT_SET是UDAF,它将group后的pic去重,并转换为了array,方便udf是用。
PS:如果不需要去重,可以使用COLLECT_LIST。
更多UDAF,见这里 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
原文地址:Hive中Group By的去重, 感谢原作者分享。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

The secret of Pandas deduplication method: a fast and efficient way to deduplicate data, which requires specific code examples. In the process of data analysis and processing, duplication in the data is often encountered. Duplicate data may mislead the analysis results, so deduplication is a very important step. Pandas, a powerful data processing library, provides a variety of methods to achieve data deduplication. This article will introduce some commonly used deduplication methods, and attach specific code examples. The most common case of deduplication based on a single column is based on whether the value of a certain column is duplicated.

In Java development, collection sorting and deduplication are common requirements. However, performance often becomes an issue when dealing with large data collections. This article will introduce some optimization techniques to help improve the performance of collection sorting and deduplication. 1. Use appropriate data structures. In Java, the most commonly used data structures are ArrayList and HashSet. ArrayList is suitable for situations where the order of elements needs to be maintained, while HashSet is suitable for situations where duplication needs to be eliminated. In sorting and deduplication scenarios, we can use

Sometimes when we use word office software to operate and edit files, some content is repeated. How can we quickly find the repeatedly entered information and then delete the repeated content? It is easy to find duplicates in an Excel spreadsheet, but will you find duplicates in a word document? Below, we will share how to remove duplicates in word, so that you can quickly find duplicate content and perform editing operations. First, open a new Word document and enter some content in the document. Consider inserting some repetitive parts to help demonstrate operations. 2. To find duplicate content, we need to click [Start]-[Search] tool in the menu bar, select [Advanced Search] in the drop-down menu, and click

The pandas deduplication methods are: 1. Use the drop_duplicates() method; 2. Use the duplicated() method; 3. Use the unique() method; 4. Use the value_counts() method. Detailed introduction: 1. Use the drop_duplicates() method to delete duplicate rows in the data frame and return a new data frame. It can set parameters to control how to perform deduplication, such as specifying the retention order and deduplication after deduplication. Time comparison columns and so on.

In PHP, you can use the following steps to disrupt the order of the array and then perform deduplication operations: Use the shuffle() function to disrupt the order of the array. Use the array_unique() function to deduplicate the array and remove duplicate elements.

In recent years, data warehouses have become an integral part of enterprise data management. Directly using the database for data analysis can meet simple query needs, but when we need to perform large-scale data analysis, a single database can no longer meet the needs. At this time, we need to use a data warehouse to process massive data. Hive is one of the most popular open source components in the data warehouse field. It can integrate the Hadoop distributed computing engine and SQL queries and support parallel processing of massive data. At the same time, in Go language, use

Three methods to deduplicate PHP arrays: use the array_unique() function to remove duplicate values based on element values and retain the key value order. Use the array_filter() function to remove duplicate elements based on the conditions of the callback function. Use the SplObjectStorage class to take advantage of the uniqueness of objects to achieve array deduplication and retain key-value associations.

As data processing becomes more and more important, big data analysis becomes more and more common. However, many companies may not want to spend a lot of money on a business analytics platform. Open source solutions offer these companies a viable option. In this article, we will discuss how to implement the open source Hive big data analysis platform using PHP. Hive is a Hadoop-based data warehouse system that can query and manage large-scale data sets on Hadoop through SQL. It uses the SQL-like HiveQL language to query
