profiles a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
将一个表的统计结果插入另一个表中:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; JOIN
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
将多表数据插入到同一表中:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
将文件流直接插入文件:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT
TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09'; This streams the data in the map phase through the script /bin/cat (like hadoop streaming).
Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)
3.2 基于Partition的查询
?一般 SELECT 查询会扫描整个表,使用 PARTITIONED BY 子句建表,查询就可以利用分区剪枝(input pruning)的特性
?Hive 当前的实现是,只有分区断言出现在离 FROM 子句最近的那个WHERE 子句中,才会启用分区剪枝
3.3 Join
Syntax
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
table_reference: table_factor | join_table
table_factor:
tbl_name [alias]
| table_subquery alias | ( table_references )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression ?Hive 只支持等值连接(equality joins)、外连接(outer joins)和(left semi joins)。Hive 不支持所有非等值的连接,因为非等值连接非常难转化到 map/reduce 任务
?LEFT,RIGHT和FULL OUTER关键字用于处理join中空记录的情况 ?LEFT SEMI JOIN 是 IN/EXISTS 子查询的一种更高效的实现
?join 时,每次 map/reduce 任务的逻辑是这样的:reducer 会缓存 join 序列中除了最后一个表的所有表的记录,再通过最后一个表将结果序列化到文件系统 ?实践中,应该把最大的那个表写在最后 join 查询时,需要注意几个关键点 ?只支持等值join
?SELECT a.* FROM a JOIN b ON (a.id = b.id) ?SELECT a.* FROM a JOIN b
ON (a.id = b.id AND a.department = b.department) ?可以 join 多于 2 个表,例如
SELECT a.val, b.val, c.val FROM a JOIN b
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
?如果join中多个表的 join key 是同一个,则 join 会被转化为单个 map/reduce 任务 LEFT,RIGHT和FULL OUTER ?例子
?SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
?如果你想限制 join 的输出,应该在 WHERE 子句中写过滤条件——或是在 join 子句中写
?
?容易混淆的问题是表分区的情况
? SELECT c.val, d.val FROM c LEFT OUTER JOIN d ON (c.key=d.key) WHERE a.ds='2010-07-07' AND b.ds='2010-07-07‘
?如果 d 表中找不到对应 c 表的记录,d 表的所有列都会列出 NULL,包括 ds 列。也就是说,join 会过滤 d 表中不能找到匹配 c 表 join key 的所有记录。这样的话,LEFT OUTER 就使得查询结果与 WHERE 子句无关 ?解决办法
?SELECT c.val, d.val FROM c LEFT OUTER JOIN d
ON (c.key=d.key AND d.ds='2009-07-07' AND c.ds='2009-07-07') LEFT SEMI JOIN
?LEFT SEMI JOIN 的限制是, JOIN 子句中右边的表只能在 ON 子句中设置过滤条件,在 WHERE 子句、SELECT 子句或其他地方过滤都不行 ?
?SELECT a.key, a.value FROM a
WHERE a.key in (SELECT b.key FROM B);
可以被重写为:
SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key) UNION ALL
?用来合并多个select的查询结果,需要保证select中字段须一致
?select_statement UNION ALL select_statement UNION ALL select_statement ...
4. 从SQL到HiveQL应转变的习惯
1、Hive不支持等值连接
?SQL中对两表内联可以写成:
?select * from dual a,dual b where a.key = b.key; ?Hive中应为
?select * from dual a join dual b on a.key = b.key; 而不是传统的格式:
SELECT t1.a1 as c1, t2.b1 as c2FROM t1, t2 WHERE t1.a2 = t2.b2
2、分号字符
?分号是SQL语句结束标记,在HiveQL中也是,但是在HiveQL中,对分号的识别没有那么智慧,例如:
?select concat(key,concat(';',key)) from dual; ?但HiveQL在解析语句时提示:
FAILED: Parse Error: line 0:-1 mismatched input '
?解决的办法是,使用分号的八进制的ASCII码进行转义,那么上述语句应写成: ?select concat(key,concat('\\073',key)) from dual;
3、IS [NOT] NULL
?SQL中null代表空值, 值得警惕的是, 在HiveQL中String类型的字段若是空(empty)字符串, 即长度为0, 那么对它进行IS NULL的判断结果是False. 4、Hive不支持将数据插入现有的表或分区中,
仅支持覆盖重写整个表,示例如下:
1. INSERT OVERWRITE TABLE t1 2. SELECT * FROM t2;
4、hive不支持INSERT INTO, UPDATE, DELETE操作
这样的话,就不要很复杂的锁机制来读写数据。
INSERT INTO syntax is only available starting in version 0.8。INSERT INTO就是在表或分区中追加数据。
5、hive支持嵌入mapreduce程序,来处理复杂的逻辑 如:
1. FROM (
2. MAP doctext USING 'python wc_mapper.py' AS (word, cnt) 3. FROM docs 4. CLUSTER BY word 5. ) a
6. REDUCE word, cnt USING 'python wc_reduce.py';
--doctext: 是输入
--word, cnt: 是map程序的输出
--CLUSTER BY: 将wordhash后,又作为reduce程序的输入
并且map程序、reduce程序可以单独使用,如: 1. FROM (
2. FROM session_table
3. SELECT sessionid, tstamp, data
4. DISTRIBUTE BY sessionid SORT BY tstamp 5. ) a
6. REDUCE sessionid, tstamp, data USING 'session_reducer.sh';
--DISTRIBUTE BY: 用于给reduce程序分配行数据
6、hive支持将转换后的数据直接写入不同的表,还能写入分区、hdfs和本地目录。
这样能免除多次扫描输入表的开销。