Difference between collect_set and collect_list functions in Hive

collect_set_vs_collect_list_in_hive

Collect_set and collect_list functions in Hive

In this blog post I will explain what is the difference between collect_set and collect_list functions in Hive.

First let us a create a table for the data set shown below.

John 8 160
John 8 160
Karen 9 100
Peter 10 660
Peter 10 600
Karen 1 100
Peter 2 200
Peter 3 600
John 4 80
Karen 5 120
William 6 170
John 7 140

 

hive (sales_db)> CREATE TABLE purchase_transaction(sales_rep_name STRING,purchase_order_id INT,transaction_amt INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
hive (sales_db)> describe purchase_transaction;
OK
sales_rep_id string
purchase_order_id int
transaction_amt int
hive (sales_db)> select * from purchase_transaction;

Output:

John 8 160
John 8 160
Karen 9 100
Peter 10 660
Peter 10 600
Karen 1 100
Peter 2 200
Peter 3 600
John 4 80
Karen 5 120
William 6 170
John 7 140

Use of collect_set function

Function Usage:

Return Type Function Name Description
array collect_set(col) Returns a set of objects with duplicate elements eliminated.
hive (sales_db)> select sales_rep_name , collect_set(transaction_amt) from purchase_transaction group by sales_rep_id;

Output

John [160,80,140]
Karen [100,120]
Peter [660,600,200]
William [170]

Use of collect_list function

Function Usage:

Return Type Function Name Description
array collect_listcol) Returns a list of objects with duplicates.
hive (sales_db)> select sales_rep_name, collect_list(transaction_amt) from purchase_transaction group by sales_rep_name;

Output

John [160,160,80,140]
Karen [100,120]
Peter [660,600,200]
William [170]

Note : The above function collect_list is available only from 0.13.0

Download code

Download data set

Reference documentation


About the course

Big Data Hadoop Training in Bangalore provided by NPN Training is a program designed to help professionals gain proficiency to work with the latest and core components of Hadoop like MapReduce, Hive, HBase, Sqoop, Quartz Scheduler, Pig and more. The course is prepared with inputs from the best in the industry.

 

Related Post
Skipping Header and Footer rows in Hive In this blog post we will explain you how to "skip header and footer rows in hive". In Hive we can ignore N number of rows from top and bottom from...

Naveen P.N

12+ years of experience in IT with vast experience in executing complex projects using Java, Micro Services , Big Data and Cloud Platforms. I found NPN Training Pvt Ltd a India based startup to provide high quality training for IT professionals. I have trained more than 3000+ IT professionals and helped them to succeed in their career in different technologies. I am very passionate about Technology and Training. I have spent 12 years at Siemens, Yahoo, Amazon and Cisco, developing and managing technology.