# SQL Hive中的Select From语句使用详解 ## 1. Hive SQL基础概述 Apache Hive是构建在Hadoop之上的数据仓库工具,它提供了类似SQL的查询语言——HiveQL(简称HQL),允许用户对存储在HDFS中的大规模数据集进行查询和分析。其中`SELECT FROM`语句作为最基础且核心的查询语法,承担着数据检索的重要功能。 ### 1.1 Hive与传统SQL的异同 相同点: - 语法结构高度相似 - 支持标准SQL操作(SELECT, JOIN, WHERE等) - 使用表、列等关系型概念 不同点: - Hive处理PB级数据,传统SQL通常处理GB/TB级 - Hive执行延迟较高(分钟级),传统SQL是毫秒级响应 - Hive底层是MapReduce/Tez/Spark作业 ## 2. 基本SELECT FROM语法结构 ### 2.1 最简查询形式 ```sql SELECT column1, column2 FROM database_name.table_name;
示例:
SELECT employee_id, name FROM hr.employees;
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number];
-- 选择单列 SELECT product_name FROM products; -- 选择多列 SELECT product_id, category, price FROM products;
-- 选择所有列(生产环境慎用) SELECT * FROM sales_records; -- 选择特定前缀的列 SELECT sales.* FROM sales_2023 sales;
SELECT user_id AS id, registration_date AS reg_date FROM users;
SELECT product_id, quantity, price, quantity * price AS total_amount FROM order_items;
-- 数值比较 SELECT * FROM products WHERE price > 100; -- 字符串比较 SELECT * FROM customers WHERE name = '张三'; -- 日期比较 SELECT * FROM orders WHERE order_date >= '2023-01-01';
-- AND 运算符 SELECT * FROM employees WHERE department = 'Sales' AND salary > 5000; -- OR 运算符 SELECT * FROM products WHERE category = 'Electronics' OR category = 'Appliances'; -- NOT 运算符 SELECT * FROM customers WHERE NOT (age BETWEEN 18 AND 30);
-- IN 操作符 SELECT * FROM products WHERE category IN ('Books', 'Music', 'Movies'); -- BETWEEN 范围查询 SELECT * FROM transactions WHERE amount BETWEEN 100 AND 1000; -- LIKE 模糊匹配 SELECT * FROM users WHERE email LIKE '%@gmail.com'; -- IS NULL 空值判断 SELECT * FROM employees WHERE manager_id IS NULL;
-- 默认升序(ASC) SELECT * FROM products ORDER BY price; -- 显式降序 SELECT * FROM sales ORDER BY sale_date DESC;
SELECT employee_id, name, department, salary FROM employees ORDER BY department ASC, salary DESC;
-- 获取价格最高的10个产品 SELECT * FROM products ORDER BY price DESC LIMIT 10;
-- 按部门统计员工数 SELECT department, COUNT(*) AS emp_count FROM employees GROUP BY department;
常用聚合函数: - COUNT() - 计数 - SUM() - 求和 - AVG() - 平均值 - MAX()/MIN() - 最大/最小值 - VARIANCE()/STDDEV() - 方差/标准差
示例:
SELECT product_category, AVG(price) AS avg_price, MAX(price) AS max_price, COUNT(*) AS product_count FROM products GROUP BY product_category;
-- 筛选平均价格大于100的类别 SELECT category, AVG(price) AS avg_price FROM products GROUP BY category HAVING avg_price > 100;
-- 获取唯一的部门列表 SELECT DISTINCT department FROM employees; -- 多列去重 SELECT DISTINCT department, job_title FROM employees;
SELECT product_id, price, CASE WHEN price > 1000 THEN 'Premium' WHEN price > 500 THEN 'Standard' ELSE 'Budget' END AS price_tier FROM products;
-- 计算部门内薪资排名 SELECT employee_id, name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank FROM employees;
-- 查询特定分区的数据 SELECT * FROM sales WHERE dt = '2023-01-15' AND region = 'North'; -- 动态分区查询 SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict;
-- 创建分桶表 CREATE TABLE bucketed_users ( id INT, name STRING ) CLUSTERED BY (id) INTO 4 BUCKETS; -- 分桶查询 SELECT * FROM bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);
-- 数据块抽样 SELECT * FROM large_table TABLESAMPLE(10 PERCENT); -- 行数抽样 SELECT * FROM large_table TABLESAMPLE(100 ROWS);
SELECT *
SET hive.exec.parallel=true;
-- 分析每月销售趋势 SELECT DATE_FORMAT(order_date, 'yyyy-MM') AS month, COUNT(DISTINCT order_id) AS order_count, SUM(amount) AS total_sales, AVG(amount) AS avg_order_value FROM orders WHERE order_date BETWEEN '2022-01-01' AND '2022-12-31' GROUP BY DATE_FORMAT(order_date, 'yyyy-MM') ORDER BY month;
-- 识别高价值用户 SELECT user_id, COUNT(DISTINCT session_id) AS session_count, SUM(CASE WHEN page_type = 'purchase' THEN 1 ELSE 0 END) AS purchase_count, SUM(duration) AS total_time_spent FROM user_sessions GROUP BY user_id HAVING purchase_count >= 3 ORDER BY total_time_spent DESC LIMIT 100;
数据倾斜:使用SKEW JOIN
优化
SET hive.optimize.skewjoin=true; SET hive.skewjoin.key=100000;
内存不足:调整Reducer数量
SET mapreduce.job.reduces=200;
小文件问题:合并输出
SET hive.merge.mapfiles=true; SET hive.merge.size.per.task=256000000;
Hive的SELECT FROM语句作为数据查询的基石,虽然语法与传统SQL相似,但在大数据环境下有其独特的优化方式和执行特性。掌握其使用方法和优化技巧,能够显著提升海量数据查询的效率。随着Hive版本的迭代,更多现代SQL特性如CTE、高级窗口函数等也被逐步支持,使得HiveQL在处理复杂分析场景时更加得心应手。
注意:实际使用时请根据具体Hive版本调整语法,本文示例基于Hive 3.x版本。 “`
这篇文章共计约2900字,全面介绍了Hive中SELECT FROM语句的使用方法,从基础语法到高级特性,包含了大量实用示例和优化建议。内容采用Markdown格式,包含代码块、列表、标题等标准元素,适合技术文档的编写和阅读。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。