AnalyticDB PostgreSQL版7.0版支持了列組(Column Group)統計信息功能,您可以指定收集多個目標列的聯合統計信息,在代價估算時打破獨立分布的假設,提高代價估算的準確性,進而提升查詢性能。

注意事項

列組統計信息功能目前僅支持Planner優化器(原Legacy優化器),如何設置優化器,請參見參數配置

語法

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON column_name, column_name [, ...]
    FROM table_name

參數介紹

參數 說明
statistics_name 列組統計信息的名稱。
statistics_kind 需要收集的統計信息種類,目前支持以下三種:
  • ndistinct:多列組合值中不同值(DISTINCT)的數量。
  • dependencies:多列之間的Functional Dependency,可以理解為多列之間的相關性。
  • mcv:多列組合值中出現最多的值及其頻率。
不指定該參數的話,以上三種類型的統計信息都會采集。
column_name 列組統計信息包含的列的名稱,至少需要指定兩個列。
table_name 上述列所屬的表名。

示例

  1. 創建表ndistinct,并插入數據。
    CREATE TABLE ndistinct(a int, b int, c int, d int) DISTRIBUTED BY (d);
    INSERT INTO ndistinct(a, b) SELECT i/100, i/200, i/100, i FROM generate_series(1, 10000000) i;

    其中a、b列的值有強相關性,不符合獨立分布的假設。

  2. 創建統計估算行數和實際行數的函數。
    CREATE FUNCTION check_estimated_rows(text) returns table (estimated int, actual int)
    LANGUAGE plpgsql AS
    $$
    DECLARE
        ln text;
        tmp text[];
        first_row bool := true;
    BEGIN
        FOR ln in
            execute format('explain analyze %s', $1)
        LOOP
            IF first_row then
                first_row := false;
                tmp := regexp_match(ln, 'rows=(\d*) .* rows=(\d*)');
                return query select tmp[1]::int, tmp[2]::int;
            END IF;
        END LOOP;
    END;
    $$;
  3. 查看估算行數和實際行數。
    SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');

    返回信息如下:

     estimated | actual
    -----------+--------
       1000000 | 100001
    (1 row)

    本次示例中估算行數為1000000行,實際行數為100001行,差距較大。

  4. 查看執行計劃。
    EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;

    執行計劃返回信息如下,用于和創建列組統計信息后的執行計劃進行對比:

                                                                            QUERY PLAN
    -------------------------------------------------------------------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=137814.33..154481.00 rows=1000000 width=20) (actual time=3986.143..4014.601 rows=100001 loops=1)
       ->  HashAggregate  (cost=137814.33..141147.67 rows=333333 width=20) (actual time=3977.037..4002.222 rows=33661 loops=1)
             Group Key: a, b, c
             Peak Memory Usage: 0 kB
             ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..104481.00 rows=3333333 width=12) (actual time=0.079..1492.983 rows=3366100 loops=1)
                   Hash Key: a, b, c
                   ->  Seq Scan on ndistinct  (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.050..1101.632 rows=3334839 loops=1)
     Planning Time: 0.161 ms
       (slice0)    Executor memory: 12336K bytes.
       (slice1)    Executor memory: 13884K bytes avg x 3 workers, 13899K bytes max (seg0).  Work_mem: 16401K bytes max.
       (slice2)    Executor memory: 37K bytes avg x 3 workers, 37K bytes max (seg0).
     Memory used:  128000kB
     Optimizer: Postgres query optimizer
     Execution Time: 4041.613 ms
    (14 rows)
  5. 創建列組統計信息,并執行ANALYZE收集統計信息。
    CREATE STATISTICS s_a_b(ndistinct) ON a,b FROM ndistinct;
    ANALYZE ndistinct;
  6. 再次查看估算行數和實際行數。
    SELECT * FROM check_estimated_rows('SELECT COUNT(*) FROM ndistinct GROUP BY a, b');

    返回信息如下:

     estimated | actual
    -----------+--------
         99431 | 100001
    (1 row)

    本次示例中估算行數為99431行,實際行數為100001行,相比較創建列組統計信息前,估算的行數更準確。

  7. 查看執行計劃。
    EXPLAIN ANALYZE SELECT count(*) FROM ndistinct GROUP BY a,b,c;

    執行計劃返回信息如下,從執行計劃可以看出,因為估算更準確,所以執行計劃中生成了Partial HashAggregate,進而縮短了查詢時間:

                                                                               QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------
     Gather Motion 3:1  (slice1; segments: 3)  (cost=75124.91..76782.09 rows=99431 width=20) (actual time=2854.765..2879.734 rows=100001 loops=1)
       ->  Finalize HashAggregate  (cost=75124.91..75456.34 rows=33144 width=20) (actual time=2853.610..2868.194 rows=33661 loops=1)
             Group Key: a, b, c
             Peak Memory Usage: 0 kB
             ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=71147.67..74130.60 rows=99431 width=20) (actual time=2269.435..2759.413 rows=100983 loops=1)
                   Hash Key: a, b, c
                   ->  Partial HashAggregate  (cost=71147.67..72141.98 rows=99431 width=20) (actual time=2744.039..2794.808 rows=100001 loops=1)
                         Group Key: a, b, c
                         Peak Memory Usage: 0 kB
                         ->  Seq Scan on ndistinct  (cost=0.00..37814.33 rows=3333333 width=12) (actual time=0.028..454.030 rows=3334839 loops=1)
     Planning Time: 0.173 ms
       (slice0)    Executor memory: 4670K bytes.
       (slice1)    Executor memory: 3134K bytes avg x 3 workers, 3149K bytes max (seg0).  Work_mem: 5649K bytes max.
       (slice2)    Executor memory: 13848K bytes avg x 3 workers, 13848K bytes max (seg0).  Work_mem: 14353K bytes max.
     Memory used:  128000kB
     Optimizer: Postgres query optimizer
     Execution Time: 2893.470 ms
    (17 rows)