Deciding when to use dynamic partitioning can be very challenging. We recommend using dynamic partitioning in these common scenarios:
- Loading from an existing table that is not partitioned: In this scenario, the user doesn’t employ partitioning initially because the table in question is expected to remain relatively small. However, over the course of time, the table grows quite large and performance issues begin to appear. These issues should be corrected using a one-time load to dynamically partition the table.
- Unknown values for partition columns: In some scenarios, it’s very difficult to know the unique values of all partition columns unless the data is manually inspected. As you can imagine, manual inspection isn’t a realistic option when dealing with batch pipelines or terabytes of data. You can try writing a Hive query to retrieve all unique value sets of partition columns. Let’s say, the query result contains many unique value sets. You’ll end up creating and executing an “ALTER TABLE” statement for each unique value set. Running a Hive query, preparing alter table statements, and executing them will significantly delay your data pipelines. Also, it’s an error-prone and cumbersome process. Hence, static partitioning is not used under these circumstances. However, dynamic partitioning can come to your rescue if you’d like to offload this work to Hive. Hive can detect the unique value sets for partition columns and create partitions on-the-fly.
Avoiding Dynamic Partitioning Pitfalls
So far, we’ve covered the advantages of dynamic partitioning. Now, let’s look at some common mistakes and pitfalls that users often encounter. The most common mistake made when using dynamic—or even static—partitions is to specify an existing column name as a partition column. In Hive, the partition columns are virtual columns, which helps with organizing and fetching the data efficiently. If you see the following error while creating a table with partitions, it means the column specified in a partition specification is also a regular column present in the table schema:
“Error in semantic analysis: Column repeated in partitioning columns”
Some other important points to remember when using dynamic partitions are:
- Typically, data is loaded into a partitioned table in Hive by “INSERT OVERWRITE TABLE <table_name> PARTITION (col1=value1, col2=value2…) SELECT <columns> FROM <table_name>…” query. As mentioned above, when using dynamic partitioning, values for partition columns must not be specified. Hive will detect the values automatically from the data.
- In order to detect the values for partition columns automatically, partition columns must be specified at the end of the “SELECT” clause in the same order as they are specified in the “PARTITIONED BY” clause while creating the table.
- If the table has only dynamic partition columns, then the configuration setting hive.exec.dynamic.partition.mode should be set to non-strict mode:
Hive enforces a limit on the number of dynamic partitions it can create. The default is 100 dynamic partitions per node, with a total (default) limit of 1000 dynamic partitions across all nodes. However, this setting is configurable. If your job tries to create more partitions than allowed, you may see the following exception:
metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic <br>partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and <br>hive.exec.max.dynamic.partitions.pernode. Maximum was set to: 100
The setting hive.exec.max.dynamic.partitions,controls the total number of dynamic partitions for a table. Whereas, hive.exec.max.dynamic.partitions.pernode controls the maximum number of dynamic partitions that can be created on a node. So, to fix the above exception, try gradually increasing the number of partitions allowed using the above settings:
SET hive.exec.dynamic.partition=true; SET hive.exec.max.dynamic.partitions=2048; SET hive.exec.max.dynamic.partitions.pernode=256;