How to understand and analyze Apache Hive query execution plan for performance debugging
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
#2 Hive contributors have striven to improve the capability of Hive in terms of both performance and functionality. We assert that understanding and analyzing Apache Hive query execution plan is crucial for performance debugging. In this talk, we study why Apache Hive’s query execution plan today are so difficult to analyze. We identify a set of pain points from our Apache Hive performance engineers, development engineers as well as real users/customers. We propose and show a new presentation data model that can well address the pain points. The three most critical parts of the presentation are (1) the estimated query execution cost, which is the planner's guess at how long it will take to run the query (measured in #rows); (2) the orchestration of the operator tree across consecutive M/R (Tez) jobs; and (3) integration and extension support with other presentation tools, e.g., Apache Ambari.
#4 We can see that Hive old style explain is quite verbose, is it necessary?
#11 Although Reduce Output Operator is crucial as it defines the boundary between a map task and a reduce task, we are thinking how much information of the reduce output operator is.... SQL users care about more about relational operators rather than MR operators.