OceanBase 分析函數(shù)

2021-06-09 16:01 更新

簡介

分析函數(shù)(某些數(shù)據(jù)庫下也叫做窗口函數(shù))與聚合函數(shù)類似,計算總是基于一組行的集合,不同的是,聚合函數(shù)一組只能返回一行,而分析函數(shù)每組可以返回多行,組內(nèi)每一行都是基于窗口的邏輯計算的結(jié)果。分析函數(shù)可以顯著優(yōu)化需要 self-join 的查詢。

分析函數(shù)語法

“窗口”也稱為 FRAME,OceanBase 數(shù)據(jù)庫同時支持 ROWS 與 RANGE 兩種 FRAME 語義,前者是基于物理行偏移的窗口,后者則是基于邏輯值偏移的窗口。

分析函數(shù)語法如下:

analytic_function:
  analytic_function([ arguments ]) OVER (analytic_clause)

analytic_clause:
  [ query_partition_clause ] [ order_by_clause [ windowing_clause ] ]

query_partition_clause:
  PARTITION BY { expr[, expr ]... | ( expr[, expr ]... ) }

order_by_clause:
  ORDER [ SIBLINGS ] BY{ expr | position | c_alias } [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] [, { expr | position | c_alias } [ ASC | DESC ][ NULLS FIRST | NULLS LAST ]]...

windowing_clause:
  { ROWS | RANGE } { BETWEEN { UNBOUNDED PRECEDING | CURRENT ROW | value_expr {
  PRECEDING | FOLLOWING } } AND{ UNBOUNDED FOLLOWING | CURRENT ROW | value_expr { 
  PRECEDING | FOLLOWING } } | { UNBOUNDED PRECEDING | CURRENT ROW| value_expr 
  PRECEDING}}

SUM/MIN/MAX/COUNT/AVG

聲明

SUM 的語法為:SUM([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

MIN 的語法為:MIN([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

MAX 的語法為:MAX([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

COUNT 的語法為:COUNT({ * | [ DISTINCT | ALL ] expr }) [ OVER (analytic_clause) ]

AVG 的語法為:AVG([ DISTINCT | ALL ] expr) [ OVER(analytic_clause) ]

說明

以上分析函數(shù)都有對應(yīng)的聚合函數(shù),其中,SUM 返回 expr 的和,MIN/MAX 返回 expr 的最小值/最大值,COUNT 返回窗口中查詢的行數(shù),AVG 返回 expr 的平均值。

對于 COUNT 函數(shù),如果指定了 expr,即返回 expr 不為 NULL 的統(tǒng)計個數(shù),如果指定 COUNT(*) 返回所有行的統(tǒng)計數(shù)目。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.17 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.03 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient>select last_name, sum(salary) over(partition by job_id) totol_s, min(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s, count(*) over(partition by job_id) count_s from exployees;
+-----------+---------+-------+-------+---------+
| last_name | totol_s | min_s | max_s | count_s |
+-----------+---------+-------+-------+---------+
| jim       |    2000 |  2000 |  2000 |       1 |
| mike      |   36000 | 11000 | 13000 |       3 |
| lily      |   36000 | 11000 | 13000 |       3 |
| tom       |   36000 | 11000 | 13000 |       3 |
+-----------+---------+-------+-------+---------+
4 rows in set (0.01 sec)

NTH_VALUE/FIRST_VALUE/LAST_VALUE

聲明

NTH_VALUE 的語法為:NTH_VALUE (measure_expr, n) [ FROM { FIRST | LAST } ] [ { RESPECT | IGNORE } NULLS ] OVER (analytic_clause)

FIRST_VALUE 的語法為:FIRST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)

LAST_VALUE 的語法為:LAST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)

說明

NTH_VALUE 函數(shù)表示第幾個值,方向由 [ FROM { FIRST | LAST } ] 確定,默認為 FROM FIRST,含有是否忽略 NULL 值的標志。其窗口為統(tǒng)一的 analytic_clause。這里 n 應(yīng)該是正數(shù),如果 n 是 NULL,函數(shù)將返回錯誤;如果 n 大于窗口內(nèi)所有的行數(shù),此函數(shù)將返回 NULL。

FIRST_VALUE 和 LAST_VALUE 表示從第一個開始計數(shù)或者是從最后一個開始計數(shù)。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> select last_name, first_value(salary) over(partition by job_id) totol_s, last_value(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s from exployees;
+-----------+---------+-------+-------+
| last_name | totol_s | min_s | max_s |
+-----------+---------+-------+-------+
| jim       |    2000 |  2000 |  2000 |
| mike      |   12000 | 11000 | 13000 |
| lily      |   12000 | 11000 | 13000 |
| tom       |   12000 | 11000 | 13000 |
+-----------+---------+-------+-------+
4 rows in set (0.01 sec)

LEAD/LAG

聲明

LEAD 的語法為:LEAD { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)

LAG 的語法為:LAG { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)

說明

LEAD 和 LAG 含義為可以在一次查詢中取出當前行的同一個字段的前面或后面第 N 行的數(shù)據(jù),這種操作可以使用相同表的自連接來實現(xiàn),但 LEAD/LAG 窗口函數(shù)有更高的效率。

其中,value_expr 是要做比對的字段,offset 是 value_expr 的偏移量,default 參數(shù)的默認值為 NULL,即如果在 LEAD/LAG 沒有顯示的設(shè)置 default 值的情況下,返回值為 NULL。例如:對 LAG 來說,當前行為 4,offset 值為 6,這時候所要找的數(shù)據(jù)就是第 -2 行,不存在此行即返回 default 的值。

[ { RESPECT | IGNORE } NULLS ] 的語法為是否考慮 NULL 值,默認為 RESPECT,考慮 NULL 值。

注意 LEAD/LAG 兩個函數(shù)后必須有 order_by_clause,數(shù)據(jù)應(yīng)該在一個列上排序之后才能有前多少行后多少行的概念。query_partition_clause 是可選的,如果沒有 query_partition_clause,就是全局的數(shù)據(jù)。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, lead(salary) over(order by salary) lead, lag(salary) over(order by salary) lag from exployees;
+-----------+-------+-------+
| last_name | lead  | lag   |
+-----------+-------+-------+
| jim       | 11000 |  NULL |
| tom       | 12000 |  2000 |
| mike      | 13000 | 11000 |
| lily      |  NULL | 12000 |
+-----------+-------+-------+
4 rows in set (0.01 sec)

STDDEV/VARIANCE/STDDEV_SAMP/STDDEV_POP

聲明

VARIANCE 的語法為:VARIANCE([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

STDDEV 的語法為:STDDEV([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

STDDEV_SAMP 的語法為:STDDEV_SAMP(expr) [ OVER (analytic_clause) ]

STDDEV_POP 的語法為:STDDEV_POP(expr) [ OVER (analytic_clause) ]

說明

VARIANCE 返回的是 expr 的方差,expr 可能是數(shù)值類型或者可以轉(zhuǎn)換成數(shù)值類型的類型,方差的類型和輸入的值的類型相同。

STDDEV 返回的是 expr 的標準差,參數(shù)類型方面和 VARIANCE 的相同。

STDDEV_SAMP 返回的是樣本標準差。

STDDEV_POP 返回的是總體標準差。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, stddev(salary) over(order by salary) std, variance(salary) over(order by salary) var, stddev_pop(salary) over() std_pop, stddev_samp(salary) over() from exployees;
+-----------+-------------------+--------------------+-------------------+----------------------------+
| last_name | std               | var                | std_pop           | stddev_samp(salary) over() |
+-----------+-------------------+--------------------+-------------------+----------------------------+
| jim       |                 0 |                  0 | 4387.482193696061 |          5066.228051190222 |
| tom       |              4500 |           20250000 | 4387.482193696061 |          5066.228051190222 |
| mike      | 4496.912521077347 | 20222222.222222224 | 4387.482193696061 |          5066.228051190222 |
| lily      | 4387.482193696061 |           19250000 | 4387.482193696061 |          5066.228051190222 |
+-----------+-------------------+--------------------+-------------------+----------------------------+
4 rows in set (0.00 sec)

NTILE

聲明

NTILE(expr) OVER ([ query_partition_clause ] order_by_clause)

說明

NTILE 函數(shù)將分區(qū)中已經(jīng)排序的行劃分為大小盡可能相同的指定數(shù)量的分組,并返回給每行組號。expr 如果是 NULL,則返回 NULL。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, ntile(10) over(partition by job_id order by salary) ntl from exployees;
+-----------+------+
| last_name | ntl  |
+-----------+------+
| jim       |    1 |
| tom       |    1 |
| mike      |    2 |
| lily      |    3 |
+-----------+------+
4 rows in set (0.01 sec)

ROW_NUMBER

聲明

ROW_NUMBER( ) OVER ([ query_partition_clause ] order_by_clause)

說明

ROW_NUMBER 函數(shù)按照 order_by_clause 子句中指定的行的順序,為每一行分配一個編號。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, row_number() over(partition by job_id order by salary) ntl from exployees;
+-----------+------+
| last_name | ntl  |
+-----------+------+
| jim       |    1 |
| tom       |    1 |
| mike      |    2 |
| lily      |    3 |
+-----------+------+
4 rows in set (0.00 sec)

RANK/DENSE_RANK/PERCENT_RANK

RANK 的語法為:RANK( ) OVER ([ query_partition_clause ] order_by_clause)

DENSE_RANK 的語法為:DENSE_RANK( ) OVER([ query_partition_clause ] order_by_clause)

PERCENT_RANK 的語法為:PERCENT_RANK( ) OVER ([ query_partition_clause ] order_by_clause)

說明

RANK 計算每一行數(shù)據(jù)在某列上的排序,該列由 order_by_clause 中的列決定。例如,按照 salary 排序可以看出員工的收入排名。

DENSE_RANK 的語義基本和 RANK 函數(shù)相同,但是 RANK 的排序中間會有‘跳過’,但是 DENSE_RANK 中不會有。

PERCENT_RANK 的語義基本和 RANK 函數(shù)相同,但是 PERCENT_RANK 排序的結(jié)果是百分比,計算的是給定行的百分比。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.10 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, rank() over(partition by job_id order by salary) rank, dense_rank() over(partition by job_id order by salary) dense_rank, percent_rank() over(partition by job_id order by salary) percent_rank from exployees;
+-----------+------+------------+----------------------------------+
| last_name | rank | dense_rank | percent_rank                     |
+-----------+------+------------+----------------------------------+
| jim       |    1 |          1 | 0.000000000000000000000000000000 |
| tom       |    1 |          1 | 0.000000000000000000000000000000 |
| mike      |    2 |          2 | 0.500000000000000000000000000000 |
| lily      |    3 |          3 | 1.000000000000000000000000000000 |
+-----------+------+------------+----------------------------------+
4 rows in set (0.01 sec)

CUME_DIST

聲明

CUME_DIST() OVER ([ query_partition_clause ] order_by_clause)

說明

該函數(shù)計算一個值的分布,返回值為大于 0 小于等于 1 的值。作為一個分析函數(shù),CUME_DIST 在升序情況下計算比當前行的特定列小的數(shù)據(jù)的占比。例如如下例子中,按 job_id 分組并在薪水排序的情況下,每行數(shù)據(jù)在窗口內(nèi)的排序列上的占比。

例子

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.10 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, cume_dist() over(partition by job_id order by salary) cume_dist from exployees;
+-----------+----------------------------------+
| last_name | cume_dist                        |
+-----------+----------------------------------+
| jim       | 1.000000000000000000000000000000 |
| tom       | 0.333333333333333333333333333333 |
| mike      | 0.666666666666666666666666666667 |
| lily      | 1.000000000000000000000000000000 |
+-----------+----------------------------------+
4 rows in set (0.01 sec)


以上內(nèi)容是否對您有幫助:
在線筆記
App下載
App下載

掃描二維碼

下載編程獅App

公眾號
微信公眾號

編程獅公眾號