SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.
The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De).
What is RegexSerDe
Regex stands for a regular expression. Whenever you want to have a kind of pattern matching, based on the pattern matching, you have to store the fields. RegexSerDe is present in org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
In the SerDeproperties, you have to define your input pattern and output fields.
Let’s consider employee dataset
If you want to extract the column values from the above dataset you can use the below mentioned pattern
‘input.regex’ = ‘(.*)$(.*)|(.*)’
To specify how to store them, you can use
‘output.format.string’ = ‘%1$s%2$s%3$s’;
Create a table
CREATE TABLE employee(id INT,name STRING,exp INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex' = '^(\\d+)\\$(.*)\\|(\\d+)');
Load data into a table
load data local inpath '/home/naveen/Desktop/employee.txt' into table employee;
hive> select * from employee; employee.id employee.name employee.exp 1 Ram 2 2 Pavan 6 2 Girish 10
About the course
Big Data Hadoop Training in Bangalore provided by NPN Training is a program designed to help professionals gain proficiency to work with the latest and core components of Hadoop like MapReduce, Hive, HBase, Sqoop, Quartz Scheduler, Pig and more. The course is prepared with inputs from the best in the industry.