Hive常用窗口函数

在SQL处理中,窗口函数都是最后一步执行,而且仅位于order by子句之前。

窗口函数 含义 用法示例 注意事项
row_num() 从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列 row_num() over (partition by poi_id order by type desc) as row_num 不支持window子句
rank() 生成数据项在分组中的排名,排名相等会在名次中留下空位 rank() over (partition by poi_id order by type desc) as rank_num 不支持window子句
dense_rank() 生成数据项在分组中的排名,排名相等在名次中不会留下空位 dense_rank() over (partition by poi_id order by type desc) as rank_num 不支持window子句
ntile(n) 将分组的数据按顺序切成n片,返回当前切片编号,如果切片不均匀,默认增加第一个切片的分布 ntile(3) over(partition by poi_id order by type) as sample 不支持window子句
lag(col, n, default) 返回当前行的上n行对应col字段的值,如果不存在就返回default默认值 lag(num, 1, 999) over(partition by poi_id order by type) as sample 不支持window子句
lead(col,n,default) 返回当前行的下n行对应col字段的值,如果不存在就返回default默认值lead(num,3,999) over(partition by poi_id order by type) as sample 不支持window子句
first_value(col) 返回当前行所在的分组的第一行的col字段值 first_value(num) over(partition by poi_id order by type) as sample
last_value(col) 返回当前行所在的分组的最后一行的col字段值 last_value(num) over(partition by poi_id order by type) as sample
percent_rank() 分组内当前行的(RANK值-1)/(分组内总行数-1),排名百分比 percent_rank() over(partition by poi_id order by type) as sample 不支持window子句
cume_dist() 小于等于当前值的行数/分组内总行数 cume_dist() over(partition by poi_id order by type) as sample 不支持window子句
count,sum,min,max,avg 常见聚合函数 count(1) over(partition by poi_id order by type rows between unbounded and current row)

窗口子句的使用方法:

  1. (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
  2. (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
  3. (ROWS | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING
     注: 也可以ROWS 1 PRECEDING这种方式使用,表示以当前行为起点/终点的附近几行。

关键字含义:

  1. PRECEDING:往前
  2. FOLLOWING:往后
  3. CURRENT ROW:当前行
  4. UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

    默认行为:

  • 当ORDER BY后面缺少窗口从句条件,窗口规范默认是:RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (分组起点到当前行)
  • 当ORDER BY和窗口从句都缺失,窗口规范默认是:ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING (分组起点到终点)

注意事项:

  1. Hive2.1.0及以后版本支持distinct:在聚合函数(sum, count, avg)中支持distinct,但是在order by或者窗口限制中不支持。eg:conut(distinct a) over(partition by c)
  2. Hive2.1.0以后支持在OVER从句中支持聚合函数:select rank() over(order by sum(b))
  3. Hive2.2.0中在使用ORDER BY和窗口限制时支持distinct:count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)
文章作者: Jack.Charles
文章链接: https://blog.zjee.me/2020/06/05/hive-aggr-function/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 江影不沉浮