Skip to content

大基数roaringbitmap值频繁重复detoast的优化

Chen Huajun edited this page Dec 18, 2022 · 2 revisions

问题

表中roaringbitmap类型的字段大小超过2K时会以toast形式存储,SQL中使用这些roaringbitmap字段时需要进行detoast。 如果一个SQL中detoast roaringbitmap数据的次数非常多,可能非常影响SQL性能。 下面是一个例子。

表定义

create table label as
  select grp1,rb_build_agg((random()*100000000)::int) member_ids
    from generate_series(1,1000000),generate_series(1,10)grp1
    group by grp1;

create table fact as
  select id,(random()*10)::int grp1, random() value1, random() value2
    from generate_series(1,10000) id;

SQL

select a.grp1,rb_cardinality(rb_build_agg(id)), sum(value1) sum_value1,sum(value2) sum_value2
  from fact a join label b on(a.grp1 = b.grp1)
  where rb_contains(b.member_ids,id)
  group by a.grp1;

这个SQL执行时间是9秒

postgres=# explain analyze
select a.grp1,rb_cardinality(rb_build_agg(id)), sum(value1) sum_value1,sum(value2) sum_value2
  from fact a join label b on(a.grp1 = b.grp1)
  where rb_contains(b.member_ids,id)
  group by a.grp1;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=920.57..2259.19 rows=200 width=28) (actual time=935.632..8999.550 rows=10 loops=1)
   Group Key: a.grp1
   ->  Merge Join  (cost=920.57..2043.51 rows=21268 width=24) (actual time=53.328..8998.784 rows=88 loops=1)
         Merge Cond: (b.grp1 = a.grp1)
         Join Filter: rb_contains(b.member_ids, a.id)
         Rows Removed by Join Filter: 9406
         ->  Sort  (cost=88.17..91.35 rows=1270 width=36) (actual time=0.025..0.036 rows=10 loops=1)
               Sort Key: b.grp1
               Sort Method: quicksort  Memory: 25kB
               ->  Seq Scan on label b  (cost=0.00..22.70 rows=1270 width=36) (actual time=0.012..0.015 rows=10 loops=1)
         ->  Sort  (cost=832.40..857.52 rows=10048 width=24) (actual time=5.266..23.894 rows=10000 loops=1)
               Sort Key: a.grp1
               Sort Method: quicksort  Memory: 1166kB
               ->  Seq Scan on fact a  (cost=0.00..164.48 rows=10048 width=24) (actual time=0.014..2.643 rows=10000 loops=1)
 Planning Time: 0.168 ms
 Execution Time: 8999.643 ms
(16 rows)

原因

通过perf分析,可以看出99%的时间都耗在对大基数roaringbitmap字段member_ids的detoast上,detoast了1w次。

-   99.05%     0.00%  postgres  roaringbitmap.so  [.] rb_exsit
     rb_exsit
   - pg_detoast_datum
      - detoast_attr
         - toast_fetch_datum
            - 98.67% table_relation_fetch_toast_slice
               + 61.83% heap_fetch_toast_slice
                 36.83% __memcpy_ssse3_back

优化方式1

SQL中反复对同一个大基数的roaringbitmap值进行解序列化,可以在roaringbitmap函数内部将解序列化的结果缓存起来。 pg_roaringbitmaprb_cache2分支中实现了这个原型。使用示例如下:

select a.grp1,rb_cardinality(rb_build_agg(id)), sum(value1) sum_value1,sum(value2) sum_value2
  from fact a join label b on(a.grp1 = b.grp1)
  where rb_cache_contains(b.member_ids,id,b.grp1::text)
  group by a.grp1;

rb_cache_contains中需要传入缓存的key,下次以相同的key调用这个函数时,直接使用第一次缓存的值。

这个SQL的执行时间是87毫秒

优化方式2

可以先detoast大基数roaringbitmap数据再通过CTE物化,将roaringbitmap字段值加载到内存里,避免重复detoast。

WITH b as MATERIALIZED(
  select grp1, member_ids::bytea::roaringbitmap from label
)
select a.grp1,rb_cardinality(rb_build_agg(id)), sum(value1) sum_value1,sum(value2) sum_value2
  from fact a join b on(a.grp1 = b.grp1)
  where rb_contains(b.member_ids,id)
  group by a.grp1;

这个SQL的执行时间是98毫秒。效果和方式一相同,但不需要修改插件代码。 优化后SQL的执行计划如下:

postgres=# explain analyze
WITH b as MATERIALIZED(
  select grp1, member_ids::bytea::roaringbitmap from label
)
select a.grp1,rb_cardinality(rb_build_agg(id)), sum(value1) sum_value1,sum(value2) sum_value2
  from fact a join b on(a.grp1 = b.grp1)
  where rb_contains(b.member_ids,id)
  group by a.grp1;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=945.13..2274.57 rows=11 width=28) (actual time=81.166..93.566 rows=10 loops=1)
   Group Key: a.grp1
   CTE b
     ->  Seq Scan on label  (cost=0.00..25.88 rows=1270 width=36) (actual time=2.997..22.095 rows=10 loops=1)
   ->  Merge Join  (cost=919.26..2036.86 rows=21167 width=24) (actual time=79.159..93.398 rows=88 loops=1)
         Merge Cond: (b.grp1 = a.grp1)
         Join Filter: rb_contains(b.member_ids, a.id)
         Rows Removed by Join Filter: 9406
         ->  Sort  (cost=90.87..94.05 rows=1270 width=36) (actual time=73.392..77.978 rows=10 loops=1)
               Sort Key: b.grp1
               Sort Method: external merge  Disk: 19600kB
               ->  CTE Scan on b  (cost=0.00..25.40 rows=1270 width=36) (actual time=4.396..44.820 rows=10 loops=1)
         ->  Sort  (cost=828.39..853.39 rows=10000 width=24) (actual time=5.556..7.226 rows=10000 loops=1)
               Sort Key: a.grp1
               Sort Method: quicksort  Memory: 1166kB
               ->  Seq Scan on fact a  (cost=0.00..164.00 rows=10000 width=24) (actual time=0.029..2.741 rows=10000 loops=1)
 Planning Time: 0.214 ms
 Execution Time: 98.528 ms
(18 rows)

CTE中的member_ids::bytea::roaringbitmap类型转换是必须的,这会促使CTE物化时进行detoast。

去掉类型转换,性能和原始的SQL一样差。

postgres=# explain analyze
WITH b as MATERIALIZED(
  select grp1, member_ids from label
)
select a.grp1,rb_cardinality(rb_build_agg(id)), sum(value1) sum_value1,sum(value2) sum_value2
  from fact a join b on(a.grp1 = b.grp1)
  where rb_contains(b.member_ids,id)
  group by a.grp1;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=941.96..2271.39 rows=11 width=28) (actual time=739.516..7810.720 rows=10 loops=1)
   Group Key: a.grp1
   CTE b
     ->  Seq Scan on label  (cost=0.00..22.70 rows=1270 width=36) (actual time=0.009..0.011 rows=10 loops=1)
   ->  Merge Join  (cost=919.26..2036.86 rows=21167 width=24) (actual time=55.051..7809.806 rows=88 loops=1)
         Merge Cond: (b.grp1 = a.grp1)
         Join Filter: rb_contains(b.member_ids, a.id)
         Rows Removed by Join Filter: 9406
         ->  Sort  (cost=90.87..94.05 rows=1270 width=36) (actual time=0.022..0.028 rows=10 loops=1)
               Sort Key: b.grp1
               Sort Method: quicksort  Memory: 25kB
               ->  CTE Scan on b  (cost=0.00..25.40 rows=1270 width=36) (actual time=0.012..0.016 rows=10 loops=1)
         ->  Sort  (cost=828.39..853.39 rows=10000 width=24) (actual time=4.950..20.384 rows=10000 loops=1)
               Sort Key: a.grp1
               Sort Method: quicksort  Memory: 1166kB
               ->  Seq Scan on fact a  (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..2.527 rows=10000 loops=1)
 Planning Time: 0.147 ms
 Execution Time: 7810.785 ms
(18 rows)

小结

  • 通过CTE物化 + ::bytea::roaringbitmap类型转换,可以有效避免大基数roaringbitmap值的重复detoast,提高SQL性能