在SQL处理中,窗口函数都是最后一步执行,而且仅位于order by子句之前。
窗口函数 | 含义 | 用法示例 | 注意事项 |
---|---|---|---|
row_num() | 从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列 | row_num() over (partition by poi_id order by type desc) as row_num | 不支持window子句 |
rank() | 生成数据项在分组中的排名,排名相等会在名次中留下空位 | rank() over (partition by poi_id order by type desc) as rank_num | 不支持window子句 |
dense_rank() | 生成数据项在分组中的排名,排名相等在名次中不会留下空位 | dense_rank() over (partition by poi_id order by type desc) as rank_num | 不支持window子句 |
ntile(n) | 将分组的数据按顺序切成n片,返回当前切片编号,如果切片不均匀,默认增加第一个切片的分布 | ntile(3) over(partition by poi_id order by type) as sample | 不支持window子句 |
lag(col, n, default) | 返回当前行的上n行对应col字段的值,如果不存在就返回default默认值 | lag(num, 1, 999) over(partition by poi_id order by type) as sample | 不支持window子句 |
lead(col,n,default) | 返回当前行的下n行对应col字段的值,如果不存在就返回default默认值lead(num,3,999) over(partition by poi_id order by type) as sample | 不支持window子句 | |
first_value(col) | 返回当前行所在的分组的第一行的col字段值 | first_value(num) over(partition by poi_id order by type) as sample | |
last_value(col) | 返回当前行所在的分组的最后一行的col字段值 | last_value(num) over(partition by poi_id order by type) as sample | |
percent_rank() | 分组内当前行的(RANK值-1)/(分组内总行数-1),排名百分比 | percent_rank() over(partition by poi_id order by type) as sample | 不支持window子句 |
cume_dist() | 小于等于当前值的行数/分组内总行数 | cume_dist() over(partition by poi_id order by type) as sample | 不支持window子句 |
count,sum,min,max,avg | 常见聚合函数 | count(1) over(partition by poi_id order by type rows between unbounded and current row) |
窗口子句的使用方法:
- (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
- (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
- (ROWS | RANGE) BETWEEN [num] PRECEDING AND (UNBOUNDED | [num]) FOLLOWING
注: 也可以ROWS 1 PRECEDING这种方式使用,表示以当前行为起点/终点的附近几行。
关键字含义:
PRECEDING
:往前FOLLOWING
:往后CURRENT ROW
:当前行UNBOUNDED
:起点,UNBOUNDED PRECEDING
表示从前面的起点,UNBOUNDED FOLLOWING
:表示到后面的终点
默认行为:
- 当ORDER BY后面缺少窗口从句条件,窗口规范默认是:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
(分组起点到当前行) - 当ORDER BY和窗口从句都缺失,窗口规范默认是:
ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
(分组起点到终点)
注意事项:
- Hive2.1.0及以后版本支持distinct:在聚合函数(sum, count, avg)中支持distinct,但是在order by或者窗口限制中不支持。eg:conut(distinct a) over(partition by c)
- Hive2.1.0以后支持在OVER从句中支持聚合函数:select rank() over(order by sum(b))
- Hive2.2.0中在使用ORDER BY和窗口限制时支持distinct:count(distinct a) over (partition by c order by d rows between 1 preceding and 1 following)