Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized aggregation with grouping by one fixed-size column #7341

Merged
merged 68 commits into from
Jan 2, 2025

Conversation

akuzm
Copy link
Member

@akuzm akuzm commented Oct 14, 2024

The implementation uses the Postgres simplehash hash table for by-value fixed-size compressed columns.

The biggest improvement on a "sensible" query is about 90%, and a couple of queries show bigger improvements but these are very synthetic cases that don't make much sense:
https://grafana.ops.savannah-dev.timescale.com/d/fasYic_4z/compare-akuzm?orgId=1&var-branch=All&var-run1=3815&var-run2=3816&var-threshold=0.02&var-use_historical_thresholds=true&var-threshold_expression=2%20%2A%20percentile_cont%280.90%29&var-exact_suite_version=false&from=now-2d&to=now

Copy link

codecov bot commented Oct 14, 2024

Codecov Report

Attention: Patch coverage is 93.40659% with 24 lines in your changes missing coverage. Please review.

Project coverage is 82.34%. Comparing base (59f50f2) to head (2bcef48).
Report is 672 commits behind head on main.

Files with missing lines Patch % Lines
tsl/src/nodes/vector_agg/grouping_policy_hash.c 91.13% 3 Missing and 11 partials ⚠️
tsl/src/nodes/vector_agg/plan.c 88.88% 2 Missing and 3 partials ⚠️
...nodes/vector_agg/function/agg_many_vector_helper.c 95.00% 0 Missing and 1 partial ⚠️
...rc/nodes/vector_agg/hashing/batch_hashing_params.h 85.71% 0 Missing and 1 partial ⚠️
...rc/nodes/vector_agg/hashing/hash_strategy_common.c 94.44% 0 Missing and 1 partial ⚠️
.../src/nodes/vector_agg/hashing/hash_strategy_impl.c 98.18% 0 Missing and 1 partial ⚠️
..._agg/hashing/hash_strategy_impl_single_fixed_key.c 95.65% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7341      +/-   ##
==========================================
+ Coverage   80.06%   82.34%   +2.27%     
==========================================
  Files         190      238      +48     
  Lines       37181    43722    +6541     
  Branches     9450    10970    +1520     
==========================================
+ Hits        29770    36002    +6232     
- Misses       2997     3385     +388     
+ Partials     4414     4335      -79     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

}
}

/*
* Currently the only grouping policy we use is per-batch grouping.
* Determine which grouping policy we are going to use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: Why is the grouping policy decided at execution time and not plan time? Should it not affect the plan and cost calc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it all to plan time, although as we discussed today on call, it doesn't affect the costs yet.

tsl/src/nodes/vector_agg/grouping_policy_hash.c Outdated Show resolved Hide resolved
tsl/src/nodes/vector_agg/grouping_policy_hash.h Outdated Show resolved Hide resolved
tsl/src/nodes/vector_agg/grouping_policy_hash.h Outdated Show resolved Hide resolved
@akuzm akuzm enabled auto-merge (squash) January 2, 2025 19:45
@akuzm akuzm merged commit 11e866e into timescale:main Jan 2, 2025
49 of 50 checks passed
@akuzm akuzm deleted the hash-simple branch January 2, 2025 19:46
pallavisontakke added a commit to pallavisontakke/timescaledb that referenced this pull request Jan 16, 2025
This release contains performance improvements and bug fixes since
the 2.17.2 release. We recommend that you upgrade at the next
available opportunity.

**Features**
* timescale#6901: Add hypertable support for transition tables.
* timescale#7104: Hypercore table access method.
* timescale#7271: Push down `order by` in real-time continuous aggregate queries.
* timescale#7295: Support `alter table set access method` on hypertable.
* timescale#7341: Vectorized aggregation with grouping by one fixed-size by-value compressed column
* timescale#7390: Disable custom `hashagg` planner code.
* timescale#7411: Change parameter name to enable hypercore table access method.
* timescale#7412: Add GUC for `hypercore_use_access_method` default.
* timescale#7413: Add GUC for segmentwise recompression.
* timescale#7433 Add support for merging chunks
* timescale#7436 Add index creation on orderby columns
* timescale#7443: Add hypercore function and view aliases.
* timescale#7455: Support `drop not null` on compressed hypertables.
* timescale#7458: Support vecorized aggregation with aggregate `filter` clauses that are also vectorizable.
* timescale#7482: Optimize recompression of partially compressed chunks.
* timescale#7486: Prevent building against postgres versions with broken ABI.
* timescale#7521 Add optional `force` argument to `refresh_continuous_aggregate`
* timescale#7528 Transform sorting on `time_bucket` to sorting on time for compressed chunks in some cases.
* timescale#7565 Add hint when hypertable creation fails
* timescale#7587 Add `include_tiered_data` parameter to `add_continuous_aggregate_policy` API

**Bugfixes**
* timescale#7378: Remove obsolete job referencing `policy_job_error_retention`.
* timescale#7409: Update `bgw_job` table when altering procedure.
* timescale#7410: Fix the `aggregated compressed column not found` error on aggregation query.
* timescale#7426: Fix `datetime` parsing error in chunk constraint creation.
* timescale#7432: Verify that the heap tuple is valid before using.
* timescale#7434: Fixes the segfault when internally setting the replica identity for a given chunk.
* timescale#7488: Emit error for transition table trigger on chunks.
* timescale#7514: Fix the error: `invalid child of chunk append`.
* timescale#7517 Fixes performance regression on `cagg_migrate` procedure
* timescale#7527 Restart scheduler on error
* timescale#7557: Fix null handling for in-memory tuple filtering.
* timescale#7566 Improve transaction check in CAgg refresh
* timescale#7584 Fix NaN-handling for vectorized aggregation

**Thanks**
* @bharrisau for reporting the segfault when creating chunks.
* @k-rus for suggesting the improvement
* @pgloader for reporting the issue in an internal background job.
* @staticlibs for sending PR to improve transaction check in CAgg refresh
* @uasiddiqi for reporting the `aggregated compressed column not found` error.
pallavisontakke added a commit to pallavisontakke/timescaledb that referenced this pull request Jan 17, 2025
This release contains performance improvements and bug fixes since
the 2.17.2 release. We recommend that you upgrade at the next
available opportunity.

**Features**
* timescale#6901: Add hypertable support for transition tables.
* timescale#7104: Hypercore table access method.
* timescale#7271: Push down `order by` in real-time continuous aggregate queries.
* timescale#7295: Support `alter table set access method` on hypertable.
* timescale#7341: Vectorized aggregation with grouping by one fixed-size by-value compressed column
* timescale#7390: Disable custom `hashagg` planner code.
* timescale#7411: Change parameter name to enable hypercore table access method.
* timescale#7412: Add GUC for `hypercore_use_access_method` default.
* timescale#7413: Add GUC for segmentwise recompression.
* timescale#7433 Add support for merging chunks
* timescale#7436 Add index creation on orderby columns
* timescale#7443: Add hypercore function and view aliases.
* timescale#7455: Support `drop not null` on compressed hypertables.
* timescale#7458: Support vecorized aggregation with aggregate `filter` clauses that are also vectorizable.
* timescale#7482: Optimize recompression of partially compressed chunks.
* timescale#7486: Prevent building against postgres versions with broken ABI.
* timescale#7521 Add optional `force` argument to `refresh_continuous_aggregate`
* timescale#7528 Transform sorting on `time_bucket` to sorting on time for compressed chunks in some cases.
* timescale#7565 Add hint when hypertable creation fails
* timescale#7587 Add `include_tiered_data` parameter to `add_continuous_aggregate_policy` API

**Bugfixes**
* timescale#7378: Remove obsolete job referencing `policy_job_error_retention`.
* timescale#7409: Update `bgw_job` table when altering procedure.
* timescale#7410: Fix the `aggregated compressed column not found` error on aggregation query.
* timescale#7426: Fix `datetime` parsing error in chunk constraint creation.
* timescale#7432: Verify that the heap tuple is valid before using.
* timescale#7434: Fixes the segfault when internally setting the replica identity for a given chunk.
* timescale#7488: Emit error for transition table trigger on chunks.
* timescale#7514: Fix the error: `invalid child of chunk append`.
* timescale#7517 Fixes performance regression on `cagg_migrate` procedure
* timescale#7527 Restart scheduler on error
* timescale#7557: Fix null handling for in-memory tuple filtering.
* timescale#7566 Improve transaction check in CAgg refresh
* timescale#7584 Fix NaN-handling for vectorized aggregation

**Thanks**
* @bharrisau for reporting the segfault when creating chunks.
* @k-rus for suggesting the improvement
* @pgloader for reporting the issue in an internal background job.
* @staticlibs for sending PR to improve transaction check in CAgg refresh
* @uasiddiqi for reporting the `aggregated compressed column not found` error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants