Skipping Header and Footer rows in Hive

In this blog post we will explain you how to “skip header and footer rows in hive”.

In Hive we can ignore N number of rows from top and bottom from a file using TBLPROPRTIES clause. The TBLPROPERTIES clause provides various feature which can be set as per our need.

 

Sample text file

 

Now assume that we are dealing with above text file which is being generated by an automated tool which adds 3 additional rows at the very beginning of the file and 2 additional rows at the end of the file. These extra rows are added to provide additional details about the file itself.

The first 3 rows contain the below details:

  1. Name of the tool which generates the file,
  2. Date along with time on which the file was generated and
  3. The file number (a sequential number which resets at the start of every month) for that particular month.

Similarly, the last 2 rows contain the below details:

  1. The total number of rows in the file (which can be used to verify missing records) and
  2. The date along with time on which the file was last modified.

Download the data set

The total number of rows in the file (which can be used to verify missing records) and
The date along with time on which the file was last modified.

CREATE DATABASE IF NOT EXISTS testdb;
USE testDB;
 
DROP TABLE IF EXISTS testdb.sample_table;
CREATE TABLE IF NOT EXISTS testdb.sample_table
(
    log_date string,
    row_num int,
    pkt_number string,
    pkt_color string,
    pkt_size string,
    pkt_quality string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' STORED AS TEXTFILE
TBLPROPERTIES('skip.header.line.count'='3', 'skip.footer.line.count'='2');