Hive中除了支持和传统数据库中一样的内关联、左关联、右关联、全关联,还支持LEFT SEMI JOIN和CROSS JOIN,但这两种JOIN类型也可以用前面的代替。
如何实现join?下面展示实操记录。
两张表的数据:
1)内关联(JOIN) 只返回能关联上的结果。
SELECT a.id, a.name,b.age FROM student a INNER JOIN studentssb b ON (a.id = b.id);(INNER可以省略)0: jdbc:hive2://hadoop1012:10000> SELECT a.id, a.name,b.age FROM student a INNER JOIN studentssb b ON (a.id = b.id);+-------+------------+--------+| a.id | a.name | b.age |+-------+------------+--------+| 1017 | student17 | 14 || 1018 | student18 | 18 || 1019 | student19 | 24 || 1001 | student1 | 12 || 1002 | student2 | 14 || 1003 | student3 | 18 || 1004 | student4 | 14 || 1005 | student5 | 18 || 1006 | student6 | 24 || 1007 | student7 | 15 || 1008 | student8 | 17 || 1009 | student9 | 18 || 1010 | student10 | 14 || 1011 | student11 | 12 || 1013 | student13 | 14 || 1014 | student14 | 15 || 1015 | student15 | 17 || 1016 | student16 | 18 |+-------+------------+--------+2) 左外关联(LEFT [OUTER] JOIN) 以LEFT [OUTER] JOIN关键字前面的表作为主表,和其他表进行关联,返回记录和主表的记录数一致,关联不上的字段置为NULL。是否指定OUTER关键字,貌似对查询结果无影响。
select a.id,a.name,b.age from student a left join studentssb b ON (a.id=b.id);0: jdbc:hive2://hadoop1012:10000> select a.id,a.name,b.age from student a left join studentssb b ON (a.id=b.id);+-------+------------+--------+| a.id | a.name | b.age |+-------+------------+--------+| 1001 | student1 | 12 || 1002 | student2 | 14 || 1003 | student3 | 18 || 1004 | student4 | NULL || 1005 | student5 | NULL || 1006 | student6 | NULL || 1007 | student7 | NULL || 1008 | student8 | NULL || 1009 | student9 | NULL || 1010 | student10 | NULL || 1011 | student11 | NULL || 1012 | student12 | NULL || 1013 | student13 | NULL || 1014 | student14 | NULL || 1015 | student15 | NULL || 1016 | student16 | NULL |+-------+------------+--------+3)右外关联(RIGHT [OUTER] JOIN) 和左外关联相反,以RIGTH [OUTER] JOIN关键词后面的表作为主表,和前面的表做关联,返回记录数和主表一致,关联不上的字段为NULL。是否指定OUTER关键字,貌似对查询结果无影响。
SELECT a.id, a.name, b.age FROM student a RIGHT OUTER JOIN studentssb b ON (a.id = b.id);0: jdbc:hive2://hadoop1012:10000> SELECT a.id, a.name, b.age FROM student a RIGHT OUTER JOIN studentssb b ON (a.id = b.id);+-------+------------+--------+| a.id | a.name | b.age |+-------+------------+--------+| 1001 | student1 | 12 || 1002 | student2 | 14 || 1003 | student3 | 18 || 1004 | student4 | 14 || 1005 | student5 | 18 || 1006 | student6 | 24 || 1007 | student7 | 15 || 1008 | student8 | 17 || 1009 | student9 | 18 || 1010 | student10 | 14 || 1011 | student11 | 12 || 1013 | student13 | 14 || 1014 | student14 | 15 || 1015 | student15 | 17 || 1016 | student16 | 18 || 1017 | student17 | 14 || 1018 | student18 | 18 || 1019 | student19 | 24 || NULL | NULL | 15 || NULL | NULL | 17 |+-------+------------+--------+4) 全外关联(FULL [OUTER] JOIN) 以两个表的记录为基准,返回两个表的记录去重之和,关联不上的字段为NULL。是否指定OUTER关键字,貌似对查询结果无影响。
注意:FULL JOIN时候,Hive不会使用MapJoin来优化。
SELECT a.id, a.name, b.age FROM student a FULL OUTER JOIN studentssb b ON (a.id = b.id);0: jdbc:hive2://hadoop1012:10000> SELECT a.id, a.name, b.age FROM student a FULL OUTER JOIN studentssb b ON (a.id = b.id);+-------+------------+--------+| a.id | a.name | b.age |+-------+------------+--------+| 1001 | student1 | 12 || 1002 | student2 | 14 || 1003 | student3 | 18 || 1004 | student4 | 14 || 1005 | student5 | 18 || 1006 | student6 | 24 || 1007 | student7 | 15 || 1008 | student8 | 17 || 1009 | student9 | 18 || 1010 | student10 | 14 || 1011 | student11 | 12 || 1012 | student12 | NULL || 1013 | student13 | 14 || 1014 | student14 | 15 || 1015 | student15 | 17 || 1016 | student16 | 18 || 1017 | student17 | 14 || 1018 | student18 | 18 || 1019 | student19 | 24 || NULL | NULL | 15 || NULL | NULL | 17 || 1022 | student22 | NULL |+-------+------------+--------+5) LEFT SEMI JOIN 以LEFT SEMI JOIN关键字前面的表为主表,返回主表的KEY也在副表中的记录。
SELECT a.id, a.name FROM student a LEFT SEMI JOIN studentssb b ON (a.id = b.id);0: jdbc:hive2://hadoop1012:10000> SELECT a.id, a.name FROM student a LEFT SEMI JOIN studentssb b ON (a.id = b.id);+-------+------------+| a.id | a.name |+-------+------------+| 1017 | student17 || 1018 | student18 || 1019 | student19 || 1001 | student1 || 1002 | student2 || 1003 | student3 || 1004 | student4 || 1005 | student5 || 1006 | student6 || 1007 | student7 || 1008 | student8 || 1009 | student9 || 1010 | student10 || 1011 | student11 || 1013 | student13 || 1014 | student14 || 1015 | student15 || 1016 | student16 |+-------+------------+6) 笛卡尔积关联(CROSS JOIN) 返回两个表的笛卡尔积结果,不需要指定关联键。
SELECT a.id, a.name, b.age FROM student a CROSS JOIN studentssb b;