Hive常用窗口函数 | 江影不沉浮

Hive常用窗口函数

2020-06-05|大数据

字数总计: 809|阅读时长: 3 分钟

在SQL处理中，窗口函数都是最后一步执行，而且仅位于order by子句之前。

窗口函数	含义	用法示例	注意事项
row_num()	从1开始，按照顺序，生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列	row_num() over (partition by poi_id order by type desc) as row_num	不支持window子句
rank()	生成数据项在分组中的排名，排名相等会在名次中留下空位	rank() over (partition by poi_id order by type desc) as rank_num	不支持window子句
dense_rank()	生成数据项在分组中的排名，排名相等在名次中不会留下空位	dense_rank() over (partition by poi_id order by type desc) as rank_num	不支持window子句
ntile(n)	将分组的数据按顺序切成n片，返回当前切片编号，如果切片不均匀，默认增加第一个切片的分布	ntile(3) over(partition by poi_id order by type) as sample	不支持window子句
lag(col, n, default)	返回当前行的上n行对应col字段的值，如果不存在就返回default默认值	lag(num, 1, 999) over(partition by poi_id order by type) as sample	不支持window子句
lead(col,n,default)	返回当前行的下n行对应col字段的值，如果不存在就返回default默认值lead(num,3,999) over(partition by poi_id order by type) as sample	不支持window子句
first_value(col)	返回当前行所在的分组的第一行的col字段值	first_value(num) over(partition by poi_id order by type) as sample
last_value(col)	返回当前行所在的分组的最后一行的col字段值	last_value(num) over(partition by poi_id order by type) as sample
percent_rank()	分组内当前行的(RANK值-1)/(分组内总行数-1)，排名百分比	percent_rank() over(partition by poi_id order by type) as sample	不支持window子句
cume_dist()	小于等于当前值的行数/分组内总行数	cume_dist() over(partition by poi_id order by type) as sample	不支持window子句
count,sum,min,max,avg	常见聚合函数	count(1) over(partition by poi_id order by type rows between unbounded and current row)

窗口子句的使用方法：

(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING
注：也可以ROWS 1 PRECEDING这种方式使用，表示以当前行为起点/终点的附近几行。

关键字含义：

PRECEDING：往前
FOLLOWING：往后
CURRENT ROW：当前行
UNBOUNDED：起点，UNBOUNDED PRECEDING 表示从前面的起点， UNBOUNDED FOLLOWING：表示到后面的终点

默认行为：

当ORDER BY后面缺少窗口从句条件，窗口规范默认是：RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW （分组起点到当前行）
当ORDER BY和窗口从句都缺失，窗口规范默认是：ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING （分组起点到终点）

注意事项：

Hive2.1.0及以后版本支持distinct：在聚合函数(sum, count, avg)中支持distinct，但是在order by或者窗口限制中不支持。eg：conut(distinct a) over(partition by c)
Hive2.1.0以后支持在OVER从句中支持聚合函数：select rank() over(order by sum(b))
Hive2.2.0中在使用ORDER BY和窗口限制时支持distinct：count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)

文章作者: Jack.Charles

文章链接: https://blog.zjee.me/2020/06/05/hive-aggr-function/

版权声明: 本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自江影不沉浮！

本地搜索

由 hexo-generator-search 提供支持