2020年，新年快乐！重新搭建了下博客，开始定期更新模式。

本文是前段时间阅读«hive权威指南»总结的笔记。

hiveql可以用来查询hdfs等分布式文件系统的数据。

我们经常会使用到hiveql来查询数据（在不涉及大量的数据迭代处理操作）。

本文大部分内容引用自«hive权威指南»。

基础 - 数据类型 #

基本数据类型 #

数据类型	长度	例子
tinyint	1byte有符号整数
smallint	2byte有符号整数
int	4byte有符号整数
bigint	8byte有符号整数
boolean	布尔类型
float	单精度浮点数
double	双精度浮点数
string	字符序列
timestamp
binary	字节数组

复合数据类型 #

数据类型	描述	示例
struct
map
array

数据库和表操作 #

数据库操作 - database #

创建数据库 - create #

hive> create database test;

判断是否存在:

hive> create database if not exists test;

为数据库添加描述信息：

hive> create database test comment "This is a example";

为数据库添加key-value属性信息：

hive> create database test
	> with dbproperties ('creator' = 'L0phTg', 'date' = '2019-08-23');

查看所有数据库 - show #

hive> show databases;

正则匹配：

hive> show databases like 'info.*';

查看数据库的描述信息 - describe #

hive> desc database test;

查看更多信息：

hive> desc database extended test;

使用数据库 - use #

即设置目标数据库为当前工作目录。

hive> use test;

删除数据库 - drop #

hive默认不允许用户删除一个包含有表的数据库。

hive> drop database if exists test;

如果想要强制删除库和库中的所有表，可以使用cascade关键字。

hive> drop database if exists test cascade;

表操作 - table #

查看表 #

显示当前工作数据库中的表。
```
hive> show tables;
```
显示指定数据库中的表。
```
hive> show tables in mydb;
```
正则匹配表名。
```
hive> show tables 'employee.*';
```
查看表的详细信息。
```
hive> desc extended mydb.employees;
```

创建表 #

创建。
```
hive> create table mytable;
```

额外的信息。

hive> create table if not exists mydb.employees (
	name	string comment 'Employee name',
	salary	float comment 'employee salary',
	subordinates array<string> comment 'Names of subordinates'
) 
   comment 'Descirption of the table'
   tblproperties ('creator'='L0phTg', 'date'='2019-08-23')
   location '/user/hive/warehouse/mydb.db/employees';

表分区 #

新建分区。

数据分区，依次按照year、month、day对表进行分区。
```
hive> create table AccessRecord (
	name 	string,
    job		string
) partitioned by (year int, month int, day int);
```
分区字段一旦建好，表现得就和普通的字段一样。

分区可以提高查询性能。
查看分区。
```
hive> show partitions AccessRecord;
```

分区过滤查询。

hive> show partitions AccessRecord(year=2019);

删除表 #

hive> drop table if exists mytable;

修改表 #

大多数表属性可以通过Alter table语句来进行修改。这种操作会修改元数据，但不会修改数据本身。

表重命名。

hive> alter table log_messages rename to logmsgs;

增加表分区。

hive> alter table log_messages add if not exists
	> partition (year=2019, month=8, day=1)
	> partition (year=2019, month=8, day=2)
	> partition (year=2019, month=8, day=3);

删除某个分区。

hive> alter table log_messages drop if exists partition(year=2019, month=9, day=3);

修改列信息。（用户可以对某个字段进行重命名，并修改其位置、类型或者注释）。

hive> alter table log_messages
	> change column hms hours_minutes_seconds int
	> comment 'The hours, minutes, and seconds part of the timestamp'
	> after severity; # after表示将字段移动到severity字段的后面

使用first关键字，可以将这个字段移动到第一个位置。

增加列。

hive> alter table log_messages add columns (
	app_name string comment 'Application Name',
	session_id long comment 'The current session id'
);

删除或者替换列。

hive> alter table log_messages replace columns (
	hours_mins_secs int comment 'hour, minute, seconds, from timestamp',
	severity string comment 'The message severity'
    message string comment 'The reset of the message'
);

这条语句重命名了之前的hms字段并且从之前的表定义的模式中移除了字段server和process_id。（因为是alter语句，所以只有表的元数据信息改变了）

数据操作 #

hive没有行级别的数据插入、数据更新和删除操作。那么往表中装载数据的唯一途径就是使用一种大量的数据装载操作。

load - 装载文件数据到内部表中 #

通过关键字overwrite来装载数据到表中。

hive> load data local inpath '$(env:HOME)/california-employees'
	> overwrite into table employees
	> partition (country = 'US', state = 'CA');

通常情况下指定的路径应该是一个目录，而不是单独的文件。

insert - 通过查询语句向表中插入数据 #

insert语句允许用户通过查询语句向目标表中插入数据。

hive> insert overwrite table employees
	> partition (count = 'US', state = 'OR')
	> select * from staged_employees se
	> where se.cnty = 'US' and se.st = 'OR';

这里使用了overwrite关键字，因此之前分区中的内容将会被覆盖掉。

如果这里使用了into关键字，那么hive会以追加的方式写入数据而不会覆盖掉之前存在的内容。

单个查询语句创建表 #

hive> create table ca_employees
	> as select name, salary, address
	> from employees
	> where se.state = 'CA';

Select 查询操作 #

select … from … #

hive> select name, salary from employees;
John Doe 	10000.0
Mary Smith 	8000.0
Todd Jones  7000.0
Bill King   6000.0

表别名查询 #

hive> select e.name, e.salary from employees e;

引用集合数据类型的元素 #

hive> select name, subordinate from employees;
John Doe 		["Mary Smith", "Todd Jones"]
Mary Smith		["Bill King"]
Todd Jones		[]
Bill King		[]
hive> select name, subordinate[0] from employees;
John Doe 		Mary Smith
Mary Smith		Bill King
Todd Jones		NULL
Bill King		NULL

引用Map元素 #

hive> select name, deductions["State Taxes"] from employees;
John Doe 		0.05
Mary Smith		0.05
Todd Jones 		0.03
Bill King		0.03

引用Struct中的元素 #

hive> select name, address.city from employees;
John Doe 		Chicago
Mary Smith		Chicago
Todd Jones		Oak Park
Bill King		Obscuria

使用正则表达式指定列 #

hive> select symbol, 'price.*' from stocks;
AAPL 143.33 123.33 132.43 133.34 155.33
AAPL ...
...
...

使用列值进行计算 #

用户不但可以选择表中的列，还可以使用函数和算术表达式来操作列值。

hive> select upper(name), salary, deductions("Federal Taxes"),
	> round(salary * (1 - deductions["Federal Taxes"])) from employees;
JOHN DOE 	10000.0		0.2 	8000  # 数据是我随便写的,
MARY SMITH	8000.0		0.2		6400
TODD JONES  7000.0 		0.15 	5900
BILL KING   6000.0		0.15	5100

算术表达式 #

运算符	类型	描述
A + B	数值	相加
A - B	数值	相减
A * B	数值	相乘
A / B	数值	相除
A % B	数值	取余
A & B	数值	与
A \| B	数值	或
A ^ B	数值	亦或
~A	数值	取反

函数 #

数学函数

返回值类型样式描述

bigint round(double d)

double round(double d, int n)

bigint floor(double d)

double rand() rand(int seed)

double exp(double d)

聚合函数

聚合函数是一类比较特殊的函数，可以对多行进行一些计算，然后得到一个结果值。

hive> select count(*), avg(salary) from employees;
4	77500.0

返回值类型	样式	描述
bigint	count(*)	计算总行数，包括含有NULL的行
bigint	count(expr)	计算提供的expr表达式的值非NULL的行数
bigint	count(distinct expr[,expr_.])	计算expr表达式的值去重后非NULL的行数
double	sum(col)	计算指定列的和
double	sum(distinct col)	计算指定列去重后的和
double	avg(col)	计算指定列的值的平均值
double	avg(distinct col)	计算指定列去重后的平均值
double	min(col)	计算指定列的最小值
double	max(col)	计算指定列的最大值



array	collect_set(col)	返回col的集合的数组

表生成函数（将单列扩展成多列或者多行）。

hive> select explode(subordinates) as sub from employees;
Mary Smith
Todd Jones
Bill King

如果某行subordinates的字段为空的话，将不会产生新的记录。

当使用表生成函数时，Hive要求使用列别名。

返回值类型	样式	描述
N行结果	explode(array arr)
N行结果	explode(map mapp)
数组的类型	explode(array a)
结果插入表中	inline(array<struct[,struct]>)
tuple	json_tuple(string jsonStr, p1, p2, …, pn)
tuple	parse_url_tuple(url, partname1, partname2, …, partNameN), N >= 1
N行结果	stack(int n, col1, …, colM)

其它内置函数。

返回值类型	样式	描述
string	ascii(string s)
string	base64(binary bin)
binary	binary(string s) / binary(binary b)
type定义的类型	cast(< expr >) as ()
string	concat(binary s1, binary s2, …)
string	concat(string s1, string s2, …)
string	concat_ws(string separator, string s1, string s2, …)
string	concat_ws(binary separator, binary s1, binary s2, …)
	context_ngrams
string	decode(binary bin, string charset)
binary	encode(string src, string charset)
int	find_in_set(string s, string commaSeparatedString)	返回s出现的位置，没有找到则返回NULL
string	format_number(number x, int d)	将数值转换为"#,###,###“格式字符串，并保留d位小数。
string	get_json_object(string json_string, string path)	从给定路径上的json字符串中抽取json对象，并返回这个对象的json字符串形式
boolean	in	test in (val1, val2,)
boolean	in_file(string s, string filename)	如果file中有完整的s匹配的话，则返回true
int	instr(string str, string substr)
int	length(string s)	计算字符串s的长度
int	locate(string substr,string str[,int pos])	查询substr在strpos后出现的位置
string	lower(string s)
string	upper(string s)
string	ltrim(string s)	将s前面出现的空格全部去除
string	parse_url(string url, string partname[,string key])

string	regexp_extract(string subject, string regex_pattern, string index)	抽取subject中符合正则表达式regex_pattern的第index个部分的子字符串
string	regex_replace(string s, string regex, string replacement)	按照java正则表达式regex将字符串s中符合条件的部分替换为replacement所指定的字符串a.
string	repeat(string s, int n)	重复输出n次字符串s
string	reverse(string s)	反转字符串

string	rtrim(string s)	将s后面出现的空格全部去除掉
string	trim(string s)	将s前后面出现的空格全部取出掉
string	from_unixtime(bigint unixtime[,string format])	将时间秒数转换为utc时间, 并用字符串表示, 可以通过format指定时间格式
bigint	unix_timestamp()
bigint	unix_timestamp(string data)	输入字符串格式"yyyy-MM-dd HH:mm:ss”
string	to_date(string timestamp)	返回yyyy-MM-dd部分
timestamp	from_utc_timestamp(timestamp timestamp, string timezone)	将其转换为指定时区的时间戳
timestamp	to_utc_timestamp(timestamp timestamp, string timezone)	转换为utc下的时间戳
int	year(string date)
int	month(string date)
int	day(strig date)
int	hour(string date)
int	minute(string date)
int	second(string date)

limit - 限制输出 #

hive> select name, salary from employees limit 2;
john doe 	11000.0
mary smith	8000.0

where - 过滤字段 #

select语句用于选取字段，where语句用于过滤条件。

where语句使用谓词表达式。有几种谓词表达式可以使用and和or相连接，当谓词表达式的计算结果为true时，相应的行被保留并输出。

hive> select * from employees
	> where country = 'US' and state = 'CA';

谓词操作符 #

操作符	支持的数据类型	描述
A = B	基本数据类型
A <=> B	基本数据类型	如果A和B都为NULL则返回NULL
A == B	没有
A <> B, A != B	基本数据类型
A < B	基本数据类型
A <= B	基本数据类型
A > B	基本数据类型
A >= B	基本数据类型
A [not] between B and C	基本数据类型
A is null	所有数据类型
A is not null	所有数据类型
A [not] like B	string类型
A rlike B, A regexp B	string类型

like 和 rlike #

like 全文匹配

rlike search匹配

group by - 分组 #

group by语句通常会和聚合函数一起使用，按照一个或者多个列对结果进行分组,然后对每个组执行聚合操作。

hive> select year(ymd), avg(price_close) from stocks
	> where exchange = 'NASDAQ' and symbol = 'AAPL'
	> group by year(ymd);
1984 	25.56777
1985	20.23454
1986	32.43563
1987 	45.12341

having #

对group by语句产生的分组进行条件过滤。

hive> select year(ymd), avg(price_close) from stocks
	> where exchange = "NASDAQ" and symbol = "AAPL"
	> group by year(ymd)
	> having avg(price_close) > 50.0;
1987		53.889
1991		52.495
1992		54.803
1999		57.748
2000		71.748

join - 多个表查询 #

inner join #

内连接：只有两个表都存在与连接标准相匹配的数据才会保留下来。

hive> select a.ymd, a.price_close, b.price_close
	> from stocks a join stocks b on a.ymd = b.ymd
	> where a.symbol = 'AAPL' and b.symbol = 'IBM';
2019-08-01		XXX 	XXX
2019-08-02		XXX		XXX
2019-08-03		XXX		XXX

on 子句支持 and。

left outer join #

hive> select s.ymd, s.symbol, s.price_close, d.dividend
	> from stocks s left outer join dividends d on s.ymd = d.ymd on s.symbol = d.symbol
	> where s.symbol = 'AAPL';

在这种join连接时，join操作符左边表的符合where子句的所有记录将会被返回。join操作符右边表中如果没有符合on后面连接条件的记录时，那么从右边表指定列选择的值将会是NULL。

right outer join #

full outer join #

order by和sort by #

Hive中order by语句和其他sql中的定义是一样的。其会对查询结果集执行一个全局排序。

Hive中提供了一个可供选择的方式，就是sort by。其会在每个reducer中对数据进行排序，也就是执行一个局部排序过程。这可以保证每个reducer的输出数据都是有序的（但并非全局有序）。

ASC表示升序排序。DESC表示降序排序。

hive> select s.ymd, s.price_close
	> from stocks s
	> order by s.ymd ASC, s.symbol DESC;

hive> select s.ymd, s.price_close
	> from stocks s
	> sort by s.ymd ASC, s.symbol DESC;

Cluster by #

Union all #

union all可以将2个或多个表进行合并。

视图 #

索引 #

Hive中没有普通关系型数据库中键的概念，但是还是可以对一些字段建立索引。一张表的索引数据存储在另外一张表中。

函数 #

发现和描述函数 #

hive> show functions;
abs
acos
and
array

展示对应函数的介绍：

hive> describe function concat;
concat(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN

展示详细文档：

hive> describe function extended concat;

调用函数 #

在查询中调用函数名即可。

同样，udf同样可以返回一个复杂的对象

标准函数 #

UDF：表示一行数据中的一列或者多列数据作为参数，然后返回结果是一个值

聚合函数 #

UDAF：接受从零行到多行的零个列到多个列，然后返回单一值。

聚合函数通常和group by结合使用

表生成函数 #

UDTF：接受零个或多个输入，然后产生多列或者多行输出。

查询实例 #

解析json查询 #

json为字符串数组格式。

explode： hive的表生成函数只是生成了一种数据的展示方式，而无法在表中产生一个其他的列。

需求：

json拆分为多个列json_tuple

select  json_name, json_value
from database1.tableName
lateral view json_tuple(json_string, "name", "value") jsonTable as json_name, json_value
where
	year = 2019
limit 10;

json数组查询 #

需求：

对数组中每个json做处理，做比较。（先explode，再json_tuple）

select json_name, json_value
from database1.tableName
lateral view explode(str_array) explode_table as str_json
lateral view json_tuple(str_json, "name", "value") jsonTable as json_name, json_value
where 
	year = 2019
limit 10;

返回值类型	样式	描述
bigint	round(double d)
double	round(double d, int n)
bigint	floor(double d)

double	rand() rand(int seed)
double	exp(double d)