Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] refactor to grid stride for data_type_detection kernel #14694

Conversation

DanialJavady96
Copy link
Contributor

@DanialJavady96 DanialJavady96 commented Jan 2, 2024

Just a small draft to make sure i am not misunderstanding anything before I modify more code. Please lmk if this is not how the grid stride is to be implemented. Addresses #14066

this benchmark is with higher occupancy, using the row count as a guide for grid size

## csv_read_data_type
### [0] NVIDIA GeForce RTX 3080

| data_type |      io       | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |     45x | 335.173 ms | 5.05% | 335.136 ms | 5.05% |        800975115 |         1.202 GiB |       668.564 MiB |
|     FLOAT | DEVICE_BUFFER |     35x | 430.598 ms | 3.25% | 430.556 ms | 3.25% |        623463015 |         1.041 GiB |       713.885 MiB |
|   DECIMAL | DEVICE_BUFFER |     26x | 594.638 ms | 3.35% | 594.632 ms | 3.35% |        451431261 |         1.501 GiB |       783.420 MiB |
| TIMESTAMP | DEVICE_BUFFER |     22x | 686.076 ms | 3.85% | 686.071 ms | 3.85% |        391264596 |         2.281 GiB |       814.268 MiB |
|  DURATION | DEVICE_BUFFER |     18x | 839.339 ms | 3.25% | 839.347 ms | 3.25% |        319814809 |         2.588 GiB |       971.320 MiB |
|    STRING | DEVICE_BUFFER |     76x | 198.889 ms | 4.69% | 198.884 ms | 4.69% |       1349709651 |       859.526 MiB |       277.082 MiB |

below is with static grid_size at 256, much lower occupancy at 15.7%

# Benchmark Results
## csv_read_data_type
### [0] NVIDIA GeForce RTX 3080

| data_type |      io       | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|-----------|---------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
|  INTEGRAL | DEVICE_BUFFER |     53x | 287.857 ms | 7.67% | 287.807 ms | 7.67% |        932692483 |         1.202 GiB |       668.564 MiB |
|     FLOAT | DEVICE_BUFFER |     46x | 330.720 ms | 6.69% | 330.697 ms | 6.69% |        811726551 |         1.041 GiB |       713.885 MiB |
|   DECIMAL | DEVICE_BUFFER |     34x | 449.532 ms | 6.43% | 449.526 ms | 6.43% |        597151885 |         1.501 GiB |       783.420 MiB |
| TIMESTAMP | DEVICE_BUFFER |     30x | 514.682 ms | 2.84% | 514.679 ms | 2.84% |        521559027 |         2.281 GiB |       814.268 MiB |
|  DURATION | DEVICE_BUFFER |     24x | 630.456 ms | 3.94% | 630.444 ms | 3.94% |        425788224 |         2.588 GiB |       971.320 MiB |
|    STRING | DEVICE_BUFFER |     90x | 166.848 ms | 2.63% | 166.836 ms | 2.63% |       1608974352 |       859.526 MiB |       277.082 MiB |

cudaOccupancyMaxPotentialBlockSize returns that 128 for block size and 612 for grid size is what achieves max occupancy btw.

Some findings:

On my 3080, I am noticing better performance at 15.7% occupancy than I am at 65-75% occupancy. I do not have access to a workstation GPU so I cannot profile on what would be a practical scenario for a user using cuIO. Given that this kernel is largely memory bound, it is hard for me to simulate a real scenario. I hesitate drawing conclusions from my benchmarking on my consumer gpu.

I am also a bit curious to know what kind of performance gains should be derived from transition towards a grid stride. From my understanding, the main problems that exist(and are compounding off eachother) are:

  1. Non coalesced memory access patterns, consequence of having a thread process an entire row. Scales with avg row size. I think more threads per row would help.
  2. Suboptimal caching
  3. Heavy reliance on atomicAdds
  4. Heavy divergence, since the kernel is generic, it has to deal with a variety of types and a lot of optimizations are lost due to this generic nature
  5. Work balancing between threads varies too much, also a consequence of variable lengths.

So im curious to know, would it make more sense to prioritize modernizing the CSV reader? I don't have perspective on the timeline for that to be achieved so I can't comment.

From this paper, IIUC, I see a lot of interest in transitioning towards DFAs ,and converting to columnar format for example. A lot of interesting strategies like using bitmaps store positions of different symbols, scanning that to find what col and row the symbol belongs to, and then using a stable radix sort so you can convert to columnar format. This would solve a lot of the problems I mentioned earlier because the data layout will be converted to something much easier to operate on. With that being said I imagine the DFA needs to be worked and will be a challenge to get right. Since I lack context / experience with these workflows I'm curious to know how progress is faring on that and if I could help somehow.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@DanialJavady96 DanialJavady96 requested review from a team as code owners January 2, 2024 18:10
Copy link

copy-pr-bot bot commented Jan 2, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jan 2, 2024
@PointKernel PointKernel added cuIO cuIO issue tech debt improvement Improvement / enhancement to an existing function Performance Performance related issue labels Jan 2, 2024
@PointKernel
Copy link
Member

/ok to test

@PointKernel PointKernel added the non-breaking Non-breaking change label Jan 2, 2024
@PointKernel
Copy link
Member

@DanialJavady96 Can you update the copyright year to 2024 to unblock the CI?

@@ -658,6 +658,21 @@ set_source_files_properties(
PROPERTIES COMPILE_DEFINITIONS "_FILE_OFFSET_BITS=64"
)

set_source_files_properties(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you've changed anything in csv_gpu.cu that requires CMake alterations. Was there an improperly resolved merge conflict somewhere?

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the grid size is chosen such that the number of threads is equal to the row count. This is the core problem we want to address. We want to enable kernels to act on tables larger than the maximum CUDA grid size (currently this is 2^31 - 1 blocks for 1D grids, and blocks have a max of 1024 threads). See #10368.

It looks like the main change so far is to use a for loop with the grid stride, which is half the solution. The other half of the solution is to change the way that the kernel is launched so that it doesn't depend on the row count -- just the kernel itself, to maximize throughput. You may need to do some benchmarking to determine the best strategy, but I think you'll want cudaOccupancyMaxPotentialBlockSize. Search the codebase for grid_size and you'll see various ways to pick a grid size. It's okay to benchmark on your RTX 3080 for testing purposes, and we can validate the final code for any perf changes on a larger GPU once the PR is ready to merge.

First, compute a grid size with cudaOccupancyMaxPotentialBlockSize (or some other approach).

// Calculate actual block count to use based on records count
int const block_size = csvparse_block_dim;
int const grid_size = (row_starts.size() + block_size - 1) / block_size;

Second, remove this comment once it's no longer true:

* Data is processed in one row/record at a time, so the number of total
* threads (tid) is equal to the number of rows.

@DanialJavady96
Copy link
Contributor Author

Currently the grid size is chosen such that the number of threads is equal to the row count. This is the core problem we want to address. We want to enable kernels to act on tables larger than the maximum CUDA grid size (currently this is 2^31 - 1 blocks for 1D grids, and blocks have a max of 1024 threads). See #10368.

It looks like the main change so far is to use a for loop with the grid stride, which is half the solution. The other half of the solution is to change the way that the kernel is launched so that it doesn't depend on the row count -- just the kernel itself, to maximize throughput. You may need to do some benchmarking to determine the best strategy, but I think you'll want cudaOccupancyMaxPotentialBlockSize. Search the codebase for grid_size and you'll see various ways to pick a grid size. It's okay to benchmark on your RTX 3080 for testing purposes, and we can validate the final code for any perf changes on a larger GPU once the PR is ready to merge.

First, compute a grid size with cudaOccupancyMaxPotentialBlockSize (or some other approach).

// Calculate actual block count to use based on records count
int const block_size = csvparse_block_dim;
int const grid_size = (row_starts.size() + block_size - 1) / block_size;

Second, remove this comment once it's no longer true:

* Data is processed in one row/record at a time, so the number of total
* threads (tid) is equal to the number of rows.

Ah yeah, like I mentioned in my OP I did use cudaOccupancyMaxPotentialBlockSize to see what configuration it would give me and it showed 612 thread blocks with a block size of 128 threads for my 3080. I should've left that as a comment instead on that line of code, that's MB. I did benchmarking with both a constant sized grid and one that isn't dependent on the row size. The PR just shows the code where the threads deployed isn't changed.

The benchmark I included shows that at a much lower occupancy (15% vs ~75%) it is performing better and I wanted to rule out the spec differences from my 3080 having much lower memory than a V100 for example

@bdice
Copy link
Contributor

bdice commented Jan 3, 2024

Ah, thanks. I read the PR description but got a little lost while reading it. I appreciate the clarification.

It surprises me that a lower occupancy grid/block configuration is performing better for you. @vuule Would you have any insights on this? I don't know the CSV kernels very well.

@DanialJavady96
Copy link
Contributor Author

Ah, thanks. I read the PR description but got a little lost while reading it. I appreciate the clarification.

It surprises me that a lower occupancy grid/block configuration is performing better for you. @vuule Would you have any insights on this? I don't know the CSV kernels very well.

All good! I get tunnel visioned easily and definitely could've worded this better and more concisely. Something for me to work on 😅

@GregoryKimball
Copy link
Contributor

Thank you @DanialJavady96 for studying this issue. For now we should focus on preventing thread indexing overflow and not include performance tuning. Please also see #10368 and #13910 for reference.

@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed tech debt labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants