Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid decoding long runs in a single thread #16304

Draft
wants to merge 27 commits into
base: branch-24.10
Choose a base branch
from

Conversation

gerashegalov
Copy link
Contributor

@gerashegalov gerashegalov commented Jul 18, 2024

Description

Split long runs among threads

Lead Co-authored-by: @abellina
Co-authored-by: @gerashegalov

Benchmarks

Generated files with a single integer column comprised of logically ~1 billion rows valued 1.

  • 4 pages. with 250 million rows per page
  • 32 pages, with 33 million rows per page
  • 1024 pages, 1 million row per page
  • 4475 pages, 240 thousand rows

The benchmark Spark app iterates these files and executes

spark.read.parquet(path).selectExpr("SUM(ones)") 

gpuDecodePageDataGeneric nsys the PR branch vs branch-24.10

branch time registers per thread shared mem executed theoretical occupancy latency
24.10 1.762 s 72 32,768 87.5 % 9.632 μs
PR 1.911 s 64 65,536 100 % 10.028 μs

ncu:

branch compute throughput memory throughput
24.10 0.21 0.21
PR 0.24 0.24

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

abellina and others added 16 commits May 14, 2024 11:30
Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Gera Shegalov <[email protected]>
…fixed_ukernel_rlestream_24.06_rebase_load_balancing
…fixed_ukernel_rlestream_24.08_rebase_load_balancing
…_rlestream_24.06_rebase_load_balancing' into gerashegalov/fixed_ukernel_rlestream_24.08_rebase_load_balancing
…fixed_ukernel_rlestream_24.08_rebase_load_balancing
…v/fixed_ukernel_rlestream_24.08_rebase_load_balancing
…v/fixed_ukernel_rlestream_24.08_rebase_load_balancing
Signed-off-by: Gera Shegalov <[email protected]>
…fixed_ukernel_rlestream_24.08_rebase_load_balancing
…fixed_ukernel_rlestream_24.08_rebase_load_balancing
Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov gerashegalov added 4 - Needs Review Waiting for reviewer to review or respond cuIO cuIO issue Performance Performance related issue labels Jul 18, 2024
@gerashegalov gerashegalov self-assigned this Jul 18, 2024
Copy link

copy-pr-bot bot commented Jul 18, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 18, 2024
@gerashegalov gerashegalov changed the title Gerashegalov/fixed ukernel rlestream 24.08 rebase load balancing Avoid decoding long runs in a single thread Jul 18, 2024
@gerashegalov gerashegalov added Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change feature request New feature or request labels Jul 18, 2024
…fixed_ukernel_rlestream_24.08_rebase_load_balancing
Signed-off-by: Gera Shegalov <[email protected]>
…fixed_ukernel_rlestream_24.08_rebase_load_balancing
…fixed_ukernel_rlestream_24.10_rebase_load_balancing
…fixed_ukernel_rlestream_24.10_rebase_load_balancing
…alov/fixed_ukernel_rlestream_24.10_rebase_load_balancing
@gerashegalov gerashegalov changed the base branch from branch-24.08 to branch-24.10 September 12, 2024 18:44
…fixed_ukernel_rlestream_24.08_rebase_load_balancing
…fixed_ukernel_rlestream_24.08_rebase_load_balancing
don't forget to undo
Signed-off-by: Gera Shegalov <[email protected]>
@gerashegalov
Copy link
Contributor Author

gerashegalov commented Sep 24, 2024

Investigating a bug in this PR where batch_len goes negative on a 33M-row run:

DEBUG thread=96 warp_id=3 warp_lane=0 => batch_len=-8 negative=1 (size=33554423 remaining=96 max_count=526848, last_run_pos=526856)

@gerashegalov
Copy link
Contributor Author

Negative batch fixed with 09dd99e

@gerashegalov
Copy link
Contributor Author

If the benchmark is scaled 100x by replicating the the 4 pages-x-250-million file the PR branch performance drops significantly

Base wall clock for the query: 360 seconds
PR wall clock: 433 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants