Home Database Mysql Tutorial Hive中Group By的去重

Hive中Group By的去重

Jun 07, 2016 pm 04:37 PM
hive Remove duplicates us

在Hive的是用中,我们经常会有这种需求: 按照同一个id进行Group By,然后对另一个字段去重,例如下面得数据: id pic1.jpg2.jpg1.jpg 此时,是用DISTINCT或者2 col得Group By都是不行得,我们可以用这个UDAF:collect_set(col),它将对同一个group by 得ke

在Hive的是用中,我们经常会有这种需求:

按照同一个id进行Group By,然后对另一个字段去重,例如下面得数据:

id pic
1.jpg
2.jpg
1.jpg
Copy after login

此时,是用DISTINCT或者2 col得Group By都是不行得,我们可以用这个UDAF:collect_set(col),它将对同一个group by 得key进行set去重后,转换为一个array。

再举一个例子,我们可以对pic进行去重,拼接:
SELECT id, CONCAT_WS(',', COLLECT_SET(pic)) FROM tbl GROUP BY id
在这里CONCAT_WS是UDF,COLLECT_SET是UDAF,它将group后的pic去重,并转换为了array,方便udf是用。

PS:如果不需要去重,可以使用COLLECT_LIST。

更多UDAF,见这里 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Revealing the efficient data deduplication method in Pandas: Tips for quickly removing duplicate data Revealing the efficient data deduplication method in Pandas: Tips for quickly removing duplicate data Jan 24, 2024 am 08:12 AM

The secret of Pandas deduplication method: a fast and efficient way to deduplicate data, which requires specific code examples. In the process of data analysis and processing, duplication in the data is often encountered. Duplicate data may mislead the analysis results, so deduplication is a very important step. Pandas, a powerful data processing library, provides a variety of methods to achieve data deduplication. This article will introduce some commonly used deduplication methods, and attach specific code examples. The most common case of deduplication based on a single column is based on whether the value of a certain column is duplicated.

How to optimize collection sorting and deduplication performance in Java development How to optimize collection sorting and deduplication performance in Java development Jul 02, 2023 am 11:25 AM

In Java development, collection sorting and deduplication are common requirements. However, performance often becomes an issue when dealing with large data collections. This article will introduce some optimization techniques to help improve the performance of collection sorting and deduplication. 1. Use appropriate data structures. In Java, the most commonly used data structures are ArrayList and HashSet. ArrayList is suitable for situations where the order of elements needs to be maintained, while HashSet is suitable for situations where duplication needs to be eliminated. In sorting and deduplication scenarios, we can use

How to remove duplicates in word How to remove duplicates in word Mar 20, 2024 pm 02:13 PM

Sometimes when we use word office software to operate and edit files, some content is repeated. How can we quickly find the repeatedly entered information and then delete the repeated content? It is easy to find duplicates in an Excel spreadsheet, but will you find duplicates in a word document? Below, we will share how to remove duplicates in word, so that you can quickly find duplicate content and perform editing operations. First, open a new Word document and enter some content in the document. Consider inserting some repetitive parts to help demonstrate operations. 2. To find duplicate content, we need to click [Start]-[Search] tool in the menu bar, select [Advanced Search] in the drop-down menu, and click

What are the methods to remove duplicates in pandas? What are the methods to remove duplicates in pandas? Nov 22, 2023 am 11:55 AM

The pandas deduplication methods are: 1. Use the drop_duplicates() method; 2. Use the duplicated() method; 3. Use the unique() method; 4. Use the value_counts() method. Detailed introduction: 1. Use the drop_duplicates() method to delete duplicate rows in the data frame and return a new data frame. It can set parameters to control how to perform deduplication, such as specifying the retention order and deduplication after deduplication. Time comparison columns and so on.

How to perform deduplication operation after the PHP array is shuffled? How to perform deduplication operation after the PHP array is shuffled? May 02, 2024 pm 01:33 PM

In PHP, you can use the following steps to disrupt the order of the array and then perform deduplication operations: Use the shuffle() function to disrupt the order of the array. Use the array_unique() function to deduplicate the array and remove duplicate elements.

Use Hive in Go language to implement efficient data warehouse Use Hive in Go language to implement efficient data warehouse Jun 15, 2023 pm 08:52 PM

In recent years, data warehouses have become an integral part of enterprise data management. Directly using the database for data analysis can meet simple query needs, but when we need to perform large-scale data analysis, a single database can no longer meet the needs. At this time, we need to use a data warehouse to process massive data. Hive is one of the most popular open source components in the data warehouse field. It can integrate the Hadoop distributed computing engine and SQL queries and support parallel processing of massive data. At the same time, in Go language, use

How to achieve deduplication of data in PHP arrays? How to achieve deduplication of data in PHP arrays? Apr 26, 2024 pm 06:51 PM

Three methods to deduplicate PHP arrays: use the array_unique() function to remove duplicate values ​​based on element values ​​and retain the key value order. Use the array_filter() function to remove duplicate elements based on the conditions of the callback function. Use the SplObjectStorage class to take advantage of the uniqueness of objects to achieve array deduplication and retain key-value associations.

PHP implements open source Hive big data analysis platform PHP implements open source Hive big data analysis platform Jun 18, 2023 pm 02:47 PM

As data processing becomes more and more important, big data analysis becomes more and more common. However, many companies may not want to spend a lot of money on a business analytics platform. Open source solutions offer these companies a viable option. In this article, we will discuss how to implement the open source Hive big data analysis platform using PHP. Hive is a Hadoop-based data warehouse system that can query and manage large-scale data sets on Hadoop through SQL. It uses the SQL-like HiveQL language to query

See all articles