Error Log:java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space.
Solution: Increase the memory allocated to the Container.
set spark.executor.memory=4g; Error Log:
22/11/28 08:24:43 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 0) java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:611) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:813) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:719) at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:787) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:547) Root Cause:The HashTable used by GroupBy is consuming too much memory, leading to an Out of Memory (OOM) error.
Solutions:
mapreduce.input.fileinputformat.split.maxsize=134217728 or mapreduce.input.fileinputformat.split.maxsize=67108864 in the configuration.spark.executor.instances.spark.executor.memory parameter to a higher value.Error Log:
FAILED: Execution ERROR, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timeout Root Cause:The possible reason for the job exception is that there are too many table partitions, which leads to a prolonged drop operation and ultimately results in a network timeout for the Hive Metastore client.
Solutions:
hive.metastore.client.socket.timeout=1200s alter table [TableName] DROP IF EXISTS PARTITION (ds<='20220720') Root Cause:The select count(1) query utilizes Hive table statistics (statistics), but the statistics for this table are inaccurate.
Solutions:Modify the configuration to disable the use of statistics.
hive.compute.query.using.stats=false Or use the analyze command to recalculate the table statistics.
analyze table <table_name> compute statistics; Symptoms:
Solutions:
set hive.optimize.skewjoin=true; Increase the number of concurrent tasks by raising the value of spark.executor.instances.
Enhance the memory allocation for Spark executors by adjusting the spark.executor.memory parameter to a higher value.
Problem Description: After creating an external table, the query returns no data.
An example of an external table creation statement is as follows.
CREATE EXTERNAL TABLE storage_log(content STRING) PARTITIONED BY (ds STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'hdfs:///your-logs/airtake/pro/storage'; Query returns no data.
select * from storage_log; Root Cause: Hive does not automatically associate the specified Partitions directory.
Solutions:
Manually specify the Partitions directory.
alter table storage_log add partition(ds=123); Query returns data.
select * from storage_log; Data returned as follows.
OK abcd 123 efgh 123