MySQL翻译系列-Window Function Frame Specification

注: frame这里翻译为帧, 取其在行数据上滚动如同胶卷滚动前行之义.

(以下内容实际上陈述的是如何定义滚动窗口的数据范围)

The definition of a window used with a window function can include a frame clause. A frame is a subset of the current partition and the frame clause specifies how to define the subset.

与窗口函数一起使用的窗口的定义可以包括帧子语句, 帧是当前分区的子集,帧语句确定如何定义这个子集.

Frames are determined with respect to the current row, which enables a frame to move within a partition depending on the location of the current row within its partition. Examples:

帧是根据当前行确定的,这使得帧可以根据当前行在分区中的位置在分区中移动, 例如:

  • By defining a frame to be all rows from the partition start to the current row, you can compute running totals for each row.
  • 通过将一个帧定义为从分区开始到当前行的所有行,可以计算针对每一行的整体结果.
  • By defining a frame as extending N rows on either side of the current row, you can compute rolling averages.
  • 通定义帧使用定义当前行和 N行之间, 可以进行滚动的计算.(滑动窗口)

The following query demonstrates the use of moving frames to compute running totals within each group of time-ordered level values, as well as rolling averages computed from the current row and the rows that immediately precede and follow it:

下面的查询演示了如何使用移动帧来计算每组按时间顺序排列的“水平”值中的运行总数,以及从当前行和紧随其后的行计算的滚动平均值:

mysql> SELECT
         time, subject, val,
         SUM(val) OVER (PARTITION BY subject ORDER BY time
                        ROWS UNBOUNDED PRECEDING)
           AS running_total,
         AVG(val) OVER (PARTITION BY subject ORDER BY time
                        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
           AS running_average
       FROM observations;
+----------+---------+------+---------------+-----------------+
| time     | subject | val  | running_total | running_average |
+----------+---------+------+---------------+-----------------+
| 07:00:00 | st113   |   10 |            10 |          9.5000 |
| 07:15:00 | st113   |    9 |            19 |         14.6667 |
| 07:30:00 | st113   |   25 |            44 |         18.0000 |
| 07:45:00 | st113   |   20 |            64 |         22.5000 |
| 07:00:00 | xh458   |    0 |             0 |          5.0000 |
| 07:15:00 | xh458   |   10 |            10 |          5.0000 |
| 07:30:00 | xh458   |    5 |            15 |         15.0000 |
| 07:45:00 | xh458   |   30 |            45 |         20.0000 |
| 08:00:00 | xh458   |   25 |            70 |         27.5000 |
+----------+---------+------+---------------+-----------------+

For the running_average column, there is no frame row preceding the first one or following the last. In these cases, AVG() computes the average of the rows that are available.

(为什么这些描述这么晦涩?)

对于running_average列, 假如帧(ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING, 这里的数据范围是前一行后一行以及自身)的前后(第一行没有后一行, 最前面的一行没有前面的一行)出现没有数据的情况, avg函数将根据获取到的具体行数来计算数值.

Aggregate functions used as window functions operate on rows in the current row frame, as do these nonaggregate window functions:

通常聚合函数用于窗口窗口函数, 以下这些是非聚合函数.

FIRST_VALUE()
LAST_VALUE()
NTH_VALUE()

Standard SQL specifies that window functions that operate on the entire partition should have no frame clause. MySQL permits a frame clause for such functions but ignores it. These functions use the entire partition even if a frame is specified:

标准SQL语句要求的是窗口函数应该对整个分区进行操作, 不应该有帧子语句. MySQL支持帧子语句, 但是在这些函数中, 子语句将被忽略, 那怕这些子语句存在.

这段话的理解就是, 这些语句是针对全局的, 例如排名, 显然不能再针对分区排名时, 再出现对子部分进行排名的可能, 所以MySQL会忽略这些子子语句, 而作用于全局.

# 作用于整个分区
CUME_DIST()
DENSE_RANK()
LAG()
LEAD()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()
pSgb0YD.png

当再前面使用count(), sum()之类的, 指定的滚动窗口范围才会生效

The frame clause, if given, has this syntax:

语法结构:

frame_clause:
    frame_units frame_extent

frame_units:
    {ROWS | RANGE}

In the absence of a frame clause, the default frame depends on whether an ORDER BY clause is present, as described later in this section.

在没有帧语句得情况下, 默认得帧将取决于order by是否存在.

The frame_units value indicates the type of relationship between the current row and frame rows:

帧的单元结构, 和当前行的关系

  • ROWS: The frame is defined by beginning and ending row positions. Offsets are differences in row numbers from the current row number.
  • 行, 这容易理解, 就是起始行, 结尾行, 根据偏移值来划定帧的范围
  • RANGE: The frame is defined by rows within a value range. Offsets are differences in row values from the current row value.
  • 根据一定的数值范围来划定帧的范围, 例如日期, 数值等.

The frame_extent value indicates the start and end points of the frame. You can specify just the start of the frame (in which case the current row is implicitly the end) or use BETWEEN to specify both frame endpoints:

帧得起始和结尾得定义方式

frame_extent:
    {frame_start | frame_between}

frame_between:
    BETWEEN frame_start AND frame_end

frame_start, frame_end: {
    CURRENT ROW
  | UNBOUNDED PRECEDING
  | UNBOUNDED FOLLOWING
  | expr PRECEDING
  | expr FOLLOWING
}

With BETWEEN syntax, frame_start must not occur later than frame_end.

The permitted frame_start and frame_end values have these meanings:

  • CURRENT ROW: For ROWS, the bound is the current row. For RANGE, the bound is the peers of the current row.

  • 当前行

  • UNBOUNDED PRECEDING: The bound is the first partition row.

  • 第一行(分区的起始边界)

  • UNBOUNDED FOLLOWING: The bound is the last partition row.

  • 最后一行(分区的最后的边界)

  • *expr* PRECEDING: For ROWS, the bound is expr rows before the current row. For RANGE, the bound is the rows with values equal to the current row value minus expr; if the current row value is NULL, the bound is the peers of the row.

    For *expr* PRECEDING (and *expr* FOLLOWING), expr can be a ? parameter marker (for use in a prepared statement), a nonnegative numeric literal, or a temporal interval of the form INTERVAL *val* *unit*. For INTERVAL expressions, val specifies nonnegative interval value, and unit is a keyword indicating the units in which the value should be interpreted. (For details about the permitted units specifiers, see the description of the DATE_ADD() function in Section 12.7, “Date and Time Functions”.)

    常量得使用, 如日期之间得间隔范围.

    RANGE on a numeric or temporal expr requires ORDER BY on a numeric or temporal expression, respectively.

    Examples of valid *expr* PRECEDING and *expr* FOLLOWING indicators:

    10 PRECEDING
    INTERVAL 5 DAY PRECEDING
    5 FOLLOWING
    INTERVAL '2:30' MINUTE_SECOND FOLLOWING
    
  • *expr* FOLLOWING: For ROWS, the bound is expr rows after the current row. For RANGE, the bound is the rows with values equal to the current row value plus expr; if the current row value is NULL, the bound is the peers of the row.

  • 注意null的存在.

    For permitted values of expr, see the description of *expr* PRECEDING.

The following query demonstrates FIRST_VALUE(), LAST_VALUE(), and two instances of NTH_VALUE():

演示例子

mysql> SELECT
         time, subject, val,
         FIRST_VALUE(val)  OVER w AS 'first',
         LAST_VALUE(val)   OVER w AS 'last',
         NTH_VALUE(val, 2) OVER w AS 'second',
         NTH_VALUE(val, 4) OVER w AS 'fourth'
       FROM observations
       WINDOW w AS (PARTITION BY subject ORDER BY time
                    ROWS UNBOUNDED PRECEDING);
+----------+---------+------+-------+------+--------+--------+
| time     | subject | val  | first | last | second | fourth |
+----------+---------+------+-------+------+--------+--------+
| 07:00:00 | st113   |   10 |    10 |   10 |   NULL |   NULL |
| 07:15:00 | st113   |    9 |    10 |    9 |      9 |   NULL |
| 07:30:00 | st113   |   25 |    10 |   25 |      9 |   NULL |
| 07:45:00 | st113   |   20 |    10 |   20 |      9 |     20 |
| 07:00:00 | xh458   |    0 |     0 |    0 |   NULL |   NULL |
| 07:15:00 | xh458   |   10 |     0 |   10 |     10 |   NULL |
| 07:30:00 | xh458   |    5 |     0 |    5 |     10 |   NULL |
| 07:45:00 | xh458   |   30 |     0 |   30 |     10 |     30 |
| 08:00:00 | xh458   |   25 |     0 |   25 |     10 |     30 |
+----------+---------+------+-------+------+--------+--------+

Each function uses the rows in the current frame, which, per the window definition shown, extends from the first partition row to the current row. For the NTH_VALUE() calls, the current frame does not always include the requested row; in such cases, the return value is NULL.

更多具体得例子见MySQL窗口函数(Window Function)详解这篇文章.

In the absence of a frame clause, the default frame depends on whether an ORDER BY clause is present:

  • With ORDER BY: The default frame includes rows from the partition start through the current row, including all peers of the current row (rows equal to the current row according to the ORDER BY clause). The default is equivalent to this frame specification:

    RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    # 这需要知道等价于这个语句(显式声明)即可
    
  • Without ORDER BY: The default frame includes all partition rows (because, without ORDER BY, all partition rows are peers). The default is equivalent to this frame specification:

    (注意项): order by的作用, 默认帧包含所有的分区行, 没有order by将所有得行都是为等同得行.

    RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    

Because the default frame differs depending on presence or absence of ORDER BY, adding ORDER BY to a query to get deterministic results may change the results. (For example, the values produced by SUM() might change.) To obtain the same results but ordered per ORDER BY, provide an explicit frame specification to be used regardless of whether ORDER BY is present.

注意帧因order by的存在而不同, 得到的结果会产生差异, 例如sum函数返回的结果. 为了获得相同的结果, 但是需要使用order by进行排序的, 提供显示声明针对帧的具体范围.

The meaning of a frame specification can be nonobvious when the current row value is NULL. Assuming that to be the case, these examples illustrate how various frame specifications apply:

mysql> select *, sum(v) over (order by v asc) from test_h;
+------+------------------------------+
| v    | sum(v) over (order by v asc) |
+------+------------------------------+
| NULL |                         NULL |
| NULL |                         NULL |
|    1 |                            1 |
|    2 |                            3 |
|    3 |                            6 |
|    4 |                           10 |
|    5 |                           15 |
|    6 |                           21 |
|    7 |                           28 |
|    8 |                           36 |
|    9 |                           45 |
|   11 |                           56 |
|   12 |                           68 |
|   13 |                           81 |
|   14 |                           95 |
+------+------------------------------+
15 rows in set (0.00 sec)

mysql> select *, sum(v) over (order by v asc range between 10 following and 15 following) from test_h;
+------+--------------------------------------------------------------------------+
| v    | sum(v) over (order by v asc range between 10 following and 15 following) |
+------+--------------------------------------------------------------------------+
| NULL |                                                                     NULL |
| NULL |                                                                     NULL |
|    1 |                                                                       50 |
|    2 |                                                                       39 |
|    3 |                                                                       27 |
|    4 |                                                                       14 |
|    5 |                                                                     NULL |
|    6 |                                                                     NULL |
|    7 |                                                                     NULL |
|    8 |                                                                     NULL |
|    9 |                                                                     NULL |
|   11 |                                                                     NULL |
|   12 |                                                                     NULL |
|   13 |                                                                     NULL |
|   14 |                                                                     NULL |
+------+--------------------------------------------------------------------------+
15 rows in set (0.00 sec)

mysql> select *, sum(v) over (order by v rows between 10 following and 15 following) from test_h;
+------+---------------------------------------------------------------------+
| v    | sum(v) over (order by v rows between 10 following and 15 following) |
+------+---------------------------------------------------------------------+
| NULL |                                                                  59 |
| NULL |                                                                  50 |
|    1 |                                                                  39 |
|    2 |                                                                  27 |
|    3 |                                                                  14 |
|    4 |                                                                NULL |
|    5 |                                                                NULL |
|    6 |                                                                NULL |
|    7 |                                                                NULL |
|    8 |                                                                NULL |
|    9 |                                                                NULL |
|   11 |                                                                NULL |
|   12 |                                                                NULL |
|   13 |                                                                NULL |
|   14 |                                                                NULL |
+------+---------------------------------------------------------------------+
15 rows in set (0.00 sec)

简而言之, null的存在增加了运算的不确定性.

当前行值假设为为'NULL 时,帧规范的含义可能失效(异常). 详情见下面例子(这些例子演示的是null值得存在, 对于运算得影响.)

  • ORDER BY X ASC RANGE BETWEEN 10 FOLLOWING AND 15 FOLLOWING, 1

    The frame starts at NULL and stops at NULL, thus includes only rows with value NULL.

    帧开始于NULL,结束于NULL, 因此只包含值为NULL的行.

  • ORDER BY X ASC RANGE BETWEEN 10 FOLLOWING AND UNBOUNDED FOLLOWING, 2

    The frame starts at NULL and stops at the end of the partition. Because an ASC sort puts NULL values first, the frame is the entire partition.

    帧开始于null值, 终止于分区边界(BETWEEN 10 FOLLOWING AND UNBOUNDED FOLLOWING). 因为ASC排序, 将null值放置到分区的第一行, 实际使得这个语句具体执行范围是整个分区(等价于RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

  • ORDER BY X DESC RANGE BETWEEN 10 FOLLOWING AND UNBOUNDED FOLLOWING, 3

    The frame starts at NULL and stops at the end of the partition. Because a DESC sort puts NULL values last, the frame is only the NULL values.

    类似于 语句 1

  • ORDER BY X ASC RANGE BETWEEN 10 PRECEDING AND UNBOUNDED FOLLOWING, 4

    The frame starts at NULL and stops at the end of the partition. Because an ASC sort puts NULL values first, the frame is the entire partition.

    类似于 语句 2

  • ORDER BY X ASC RANGE BETWEEN 10 PRECEDING AND 10 FOLLOWING, 5

    The frame starts at NULL and stops at NULL, thus includes only rows with value NULL.

    类似于 语句 1

  • ORDER BY X ASC RANGE BETWEEN 10 PRECEDING AND 1 PRECEDING, 6

    The frame starts at NULL and stops at NULL, thus includes only rows with value NULL.

    类似于 语句 1

  • ORDER BY X ASC RANGE BETWEEN UNBOUNDED PRECEDING AND 10 FOLLOWING, 7

    The frame starts at the beginning of the partition and stops at rows with value NULL. Because an ASC sort puts NULL values first, the frame is only the NULL values.

    类似于 语句 1