Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Testing-only PR to check maint-19.0.1 status #45401

Open
wants to merge 10 commits into
base: maint-19.0.0
Choose a base branch
from

Conversation

amoeba
Copy link
Member

@amoeba amoeba commented Jan 31, 2025

Caution

Do not merge this PR.

This PR is only to pre-check the first RC for 19.0.1 with crossbow and should not be merged.

zeroshade and others added 8 commits January 31, 2025 13:46
### Rationale for this change
If the value for Decimal32 or Decimal64 is `INT32_MIN` or `INT64_MIN` respectively, then UBSAN reports an issue when calling Negate on them due to overflow. 

### What changes are included in this PR?
Have the `Negate` methods of Decimal32 and Decimal64 use `arrow::internal::SafeSignedNegate`.

### Are these changes tested?
Unit tests were added for both cases which were able to reproduce the problem when UBSAN was on without the fix.

### Are there any user-facing changes?
No.

* OSS-Fuzz issue: https://issues.oss-fuzz.com/issues/371239168

* GitHub Issue: #45180

Authored-by: Matt Topol <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
…blems (#45228)

This adds a LinkedIn icon social link to the main docs pages and fixes the layout of the top nav, which previously looked bad
at widths between 960px and 1200px.

* GitHub Issue: #45230
#45108)

### Rationale for this change

#44513 triggers two distinct overflow issues within swiss join, both happening when the build side table contains large enough number of rows or distinct keys. (Cases at this extent of hash join build side are rather rare, so we haven't seen them reported until now):

1. The first issue is, our swiss table implementation takes the higher `N` bits of 32-bit hash value as the index to a buffer storing "block"s (a block contains `8` key - in some code also referred to as "group" - ids). This `N`-bit number is further multiplied by the size of a block, which is also related to `N`. The `N` in the case of #44513 is `26` and a block takes `40` bytes. So the multiply is possible to produce a number over `1 << 31` (negative when interpreted as signed 32bit). In our AVX2 specialization of accessing the block buffer https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal_avx2.cc#L404 , the issue like #41813 (comment) shows up. This is the actual issue that directly produced the segfault in #44513.
2. The other issue is, we take `7` bits of the 32-bit hash value after `N` as a "stamp" (to quick fail the hash comparison). But when `N` is greater than `25`, some arithmetic code like https://github.com/apache/arrow/blob/0a00e25f2f6fb927fb555b69038d0be9b9d9f265/cpp/src/arrow/compute/key_map_internal.cc#L397 (`bits_hash_` is `constexpr 32`, `log_blocks_` is `N`, `bits_stamp_` is `constexpr 7`, this is to retrieve the stamp from a hash) produces `hash >> -1` aka `hash >> 0xFFFFFFFF` aka `hash >> 31` (the heading `1`s are trimmed) then the stamp value is wrong and results in false-mismatched rows. This is the reason of my false positive run in #44513 (comment) .

### What changes are included in this PR?

For issue 1, use 64-bit index gather intrinsic to avoid the offset overflow.

For issue 2, do not right-shift the hash if `N + 7 >= 32`. This is actually allowing the bits overlapping between block id (the `N` bits) and stamp (the `7` bits). Though this may introduce more false-positive hash comparisons (thus worsen the performance), I think this is still more reasonable than brutally failing for `N > 25`. I introduce two members `bits_shift_for_block_and_stamp_` and `bits_shift_for_block_`, which are derived from `log_blocks_` - esp. set to `0` and `32 - N` when `N + 7 >= 32`, this is to avoid branching like `if (log_blocks_ + bits_stamp_ > bits_hash_)` in tight loops.

### Are these changes tested?

The fix is manually tested with the original case in my local. (I do have a concrete C++ UT to verify the fix but it requires too much resource and runs for too long time so it is impractical to run in any reasonable CI environment.)

### Are there any user-facing changes?

None.

* GitHub Issue: #44513

Lead-authored-by: Rossi Sun <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Rossi Sun <[email protected]>
…5285)

The level histogram of size statistics can be omitted if its max level is 0. We haven't implemented this yet and enforces histogram size to be equal to `max_level + 1`. However, when reading a Parquet file with omitted level histogram, exception will be thrown.

Omit level histogram when max level is 0.

Yes, a test case has been added to reflect the change.

No.
* GitHub Issue: #45283

Lead-authored-by: Gang Wu <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Gang Wu <[email protected]>
…d MinIO (#45310)

### Rationale for this change

Some AWS SDK versions have faulty chunked encoding when the body is 0 bytes:
aws/aws-sdk-cpp#3259

### What changes are included in this PR?

Work around faulty chunked encoding implementation by only setting a body stream if non-empty.

### Are these changes tested?

Locally for now, but will be picked by CI (and conda-forge) at some point.

### Are there any user-facing changes?

No.

* GitHub Issue: #45304

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change

See #45120

### What changes are included in this PR?
Disable pointless test

### Are these changes tested?

N/A

### Are there any user-facing changes?
No
* GitHub Issue: #45357

Lead-authored-by: David Li <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
…nd multiple row groups (#45350)

### Rationale for this change

Loading `arrow::ArrayStatistics` logic depends on `parquet::ColumnChunkMetaData`.

We can't get `parquet::ColumnChunkMetaData` when requested row groups are empty because no associated row group and column chunk exist.

We can't use multiple `parquet::ColumnChunkMetaData`s for now because we don't have statistics merge logic. So we can't load statistics when we use multiple row groups. 

### What changes are included in this PR?

* Don't load statistics when no row groups are used
* Don't load statistics when multiple row groups are used
* Add `parquet::ArrowReaderProperties::{set_,}should_load_statistics()` to enforce loading statistics by loading row group one by one

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #45339

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
…pandas>=2.3 (#45383)

The option already exists in pandas 2.2, but for that version our code does not work, so restricting it to pandas >= 2.3

* GitHub Issue: #45296

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
@amoeba
Copy link
Member Author

amoeba commented Jan 31, 2025

@github-actions crossbow submit --group verify-rc-source

@amoeba
Copy link
Member Author

amoeba commented Jan 31, 2025

@github-actions crossbow submit --group packaging

@github-actions github-actions bot added the awaiting review Awaiting review label Jan 31, 2025
Copy link

Revision: e696b26

Submitted crossbow builds: ursacomputing/crossbow @ actions-0ef1a4e171

Task Status
verify-rc-source-cpp-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-cpp-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-cpp-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-cpp-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-cpp-macos-amd64 GitHub Actions
verify-rc-source-cpp-macos-arm64 GitHub Actions
verify-rc-source-cpp-macos-conda-amd64 GitHub Actions
verify-rc-source-csharp-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-csharp-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-csharp-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-csharp-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-csharp-macos-amd64 GitHub Actions
verify-rc-source-csharp-macos-arm64 GitHub Actions
verify-rc-source-integration-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-integration-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-integration-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-integration-macos-amd64 GitHub Actions
verify-rc-source-integration-macos-arm64 GitHub Actions
verify-rc-source-integration-macos-conda-amd64 GitHub Actions
verify-rc-source-js-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-js-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-js-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-js-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-js-macos-amd64 GitHub Actions
verify-rc-source-js-macos-arm64 GitHub Actions
verify-rc-source-python-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-python-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-python-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-python-macos-amd64 GitHub Actions
verify-rc-source-python-macos-arm64 GitHub Actions
verify-rc-source-python-macos-conda-amd64 GitHub Actions
verify-rc-source-ruby-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-ruby-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-ruby-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-ruby-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-ruby-macos-amd64 GitHub Actions
verify-rc-source-ruby-macos-arm64 GitHub Actions
verify-rc-source-windows GitHub Actions

Copy link

Revision: e696b26

Submitted crossbow builds: ursacomputing/crossbow @ actions-aae58ed5ac

Task Status
almalinux-8-amd64 GitHub Actions
almalinux-8-arm64 GitHub Actions
almalinux-9-amd64 GitHub Actions
almalinux-9-arm64 GitHub Actions
amazon-linux-2023-amd64 GitHub Actions
amazon-linux-2023-arm64 GitHub Actions
centos-7-amd64 GitHub Actions
centos-8-stream-amd64 GitHub Actions
centos-8-stream-arm64 GitHub Actions
centos-9-stream-amd64 GitHub Actions
centos-9-stream-arm64 GitHub Actions
conan-maximum GitHub Actions
conan-minimum GitHub Actions
debian-bookworm-amd64 GitHub Actions
debian-bookworm-arm64 GitHub Actions
debian-trixie-amd64 GitHub Actions
debian-trixie-arm64 GitHub Actions
matlab GitHub Actions
nuget GitHub Actions
python-sdist GitHub Actions
r-binary-packages GitHub Actions
test-debian-12-docs GitHub Actions
ubuntu-focal-amd64 GitHub Actions
ubuntu-focal-arm64 GitHub Actions
ubuntu-jammy-amd64 GitHub Actions
ubuntu-jammy-arm64 GitHub Actions
ubuntu-noble-amd64 GitHub Actions
ubuntu-noble-arm64 GitHub Actions
wheel-macos-monterey-cp310-cp310-amd64 GitHub Actions
wheel-macos-monterey-cp310-cp310-arm64 GitHub Actions
wheel-macos-monterey-cp311-cp311-amd64 GitHub Actions
wheel-macos-monterey-cp311-cp311-arm64 GitHub Actions
wheel-macos-monterey-cp312-cp312-amd64 GitHub Actions
wheel-macos-monterey-cp312-cp312-arm64 GitHub Actions
wheel-macos-monterey-cp313-cp313-amd64 GitHub Actions
wheel-macos-monterey-cp313-cp313-arm64 GitHub Actions
wheel-macos-monterey-cp313-cp313t-amd64 GitHub Actions
wheel-macos-monterey-cp313-cp313t-arm64 GitHub Actions
wheel-macos-monterey-cp39-cp39-amd64 GitHub Actions
wheel-macos-monterey-cp39-cp39-arm64 GitHub Actions
wheel-manylinux-2-28-cp310-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313-amd64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313-arm64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313t-amd64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313t-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp313-cp313-amd64 GitHub Actions
wheel-manylinux-2014-cp313-cp313-arm64 GitHub Actions
wheel-manylinux-2014-cp313-cp313t-amd64 GitHub Actions
wheel-manylinux-2014-cp313-cp313t-arm64 GitHub Actions
wheel-manylinux-2014-cp39-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp313-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

…stics (#45202)

We found out in #45085 that there is a non-trivial overhead when writing size statistics is enabled.

Dramatically reduce overhead by speeding up def/rep levels histogram updates.

Performance results on the author's machine:
```
------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                   8103053 ns      8098569 ns           86 bytes_per_second=1003.26Mi/s items_per_second=129.477M/s output_size=537.472k page_index_size=33
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>            8153499 ns      8148492 ns           86 bytes_per_second=997.117Mi/s items_per_second=128.683M/s output_size=537.488k page_index_size=33
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>     8212560 ns      8207754 ns           83 bytes_per_second=989.918Mi/s items_per_second=127.754M/s output_size=537.502k page_index_size=47

BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType>                 10405020 ns     10400775 ns           67 bytes_per_second=444.142Mi/s items_per_second=100.817M/s output_size=848.305k page_index_size=34
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>          10464784 ns     10460778 ns           66 bytes_per_second=441.594Mi/s items_per_second=100.239M/s output_size=848.325k page_index_size=34
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>   10469832 ns     10465739 ns           67 bytes_per_second=441.385Mi/s items_per_second=100.191M/s output_size=848.344k page_index_size=48

BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                       13004962 ns     12992678 ns           52 bytes_per_second=657.101Mi/s items_per_second=80.7052M/s output_size=617.464k page_index_size=34
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>                13718352 ns     13705599 ns           50 bytes_per_second=622.921Mi/s items_per_second=76.5071M/s output_size=617.486k page_index_size=34
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>         13845553 ns     13832138 ns           52 bytes_per_second=617.222Mi/s items_per_second=75.8072M/s output_size=617.506k page_index_size=54

BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType>                      15715263 ns     15702707 ns           44 bytes_per_second=320.449Mi/s items_per_second=66.7768M/s output_size=927.326k page_index_size=35
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>               16507328 ns     16493800 ns           43 bytes_per_second=305.079Mi/s items_per_second=63.5739M/s output_size=927.352k page_index_size=35
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>        16575359 ns     16561311 ns           42 bytes_per_second=303.836Mi/s items_per_second=63.3148M/s output_size=927.377k page_index_size=55
```

Performance results without this PR:
```
------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                      Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                   8042576 ns      8037678 ns           87 bytes_per_second=1010.86Mi/s items_per_second=130.458M/s output_size=537.472k page_index_size=33
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>            9576627 ns      9571279 ns           73 bytes_per_second=848.894Mi/s items_per_second=109.554M/s output_size=537.488k page_index_size=33
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>     9570204 ns      9563595 ns           73 bytes_per_second=849.576Mi/s items_per_second=109.642M/s output_size=537.502k page_index_size=47

BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType>                 10165397 ns     10160868 ns           69 bytes_per_second=454.628Mi/s items_per_second=103.197M/s output_size=848.305k page_index_size=34
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>          11662568 ns     11657396 ns           60 bytes_per_second=396.265Mi/s items_per_second=89.9494M/s output_size=848.325k page_index_size=34
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>   11657135 ns     11653063 ns           60 bytes_per_second=396.412Mi/s items_per_second=89.9829M/s output_size=848.344k page_index_size=48

BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type>                       13182006 ns     13168704 ns           51 bytes_per_second=648.318Mi/s items_per_second=79.6264M/s output_size=617.464k page_index_size=34
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type>                16438205 ns     16421762 ns           43 bytes_per_second=519.89Mi/s items_per_second=63.8528M/s output_size=617.486k page_index_size=34
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type>         16424615 ns     16409032 ns           42 bytes_per_second=520.293Mi/s items_per_second=63.9024M/s output_size=617.506k page_index_size=54

BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType>                      15387808 ns     15373086 ns           46 bytes_per_second=327.32Mi/s items_per_second=68.2086M/s output_size=927.326k page_index_size=35
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType>               18319628 ns     18302938 ns           37 bytes_per_second=274.924Mi/s items_per_second=57.29M/s output_size=927.352k page_index_size=35
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType>        18346665 ns     18329336 ns           37 bytes_per_second=274.528Mi/s items_per_second=57.2075M/s output_size=927.377k page_index_size=55
```

Tested by existing tests, validated by existing benchmarks.

No.

* GitHub Issue: #45201

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@amoeba
Copy link
Member Author

amoeba commented Jan 31, 2025

After my initial push, I noticed we needed to cherry-pick a 20.0.0 issue, #45201, in order to get a commit we already cherry-picked for this release to compile. I moved that issue into the 19.0.1 milestone and ran:

git checkout maint-19.0.1
git cherry-pick 2b5f56ca999678411f35862539f4f4a53b38de5a
git push apache maint-19.0.1

resulting in cherry-picked commit 1b9079c.

@amoeba
Copy link
Member Author

amoeba commented Feb 1, 2025

There's a failure in the regular CI checks that looks worth investigating. From https://github.com/apache/arrow/actions/runs/13081610957/job/36506197170?pr=45401:

Test failure output
2025-01-31T22:29:22.3034360Z opt/conda/envs/arrow/lib/python3.11/site-packages/pyarrow/tests/test_fs.py:395: 
2025-01-31T22:29:22.3034525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-01-31T22:29:22.3034740Z pyarrow/_fs.pyx:625: in pyarrow._fs.FileSystem.delete_dir
2025-01-31T22:29:22.3034841Z     ???
2025-01-31T22:29:22.3035048Z pyarrow/error.pxi:89: in pyarrow.lib.check_status
2025-01-31T22:29:22.3035131Z     ???
2025-01-31T22:29:22.3035376Z pyarrow/_fs.pyx:1529: in pyarrow._fs._cb_delete_dir
2025-01-31T22:29:22.3035475Z     ???
2025-01-31T22:29:22.3035917Z opt/conda/envs/arrow/lib/python3.11/site-packages/pyarrow/fs.py:366: in delete_dir
2025-01-31T22:29:22.3036047Z     self.fs.rm(path, recursive=True)
2025-01-31T22:29:22.3036298Z opt/conda/envs/arrow/lib/python3.11/site-packages/fsspec/asyn.py:118: in wrapper
2025-01-31T22:29:22.3036506Z     return sync(self.loop, func, *args, **kwargs)
2025-01-31T22:29:22.3036771Z opt/conda/envs/arrow/lib/python3.11/site-packages/fsspec/asyn.py:103: in sync
2025-01-31T22:29:22.3036940Z     raise return_result
2025-01-31T22:29:22.3037184Z opt/conda/envs/arrow/lib/python3.11/site-packages/fsspec/asyn.py:56: in _runner
2025-01-31T22:29:22.3037339Z     result[0] = await coro
2025-01-31T22:29:22.3037553Z opt/conda/envs/arrow/lib/python3.11/site-packages/s3fs/core.py:2052: in _rm
2025-01-31T22:29:22.3037771Z     out = await _run_coros_in_chunks(
2025-01-31T22:29:22.3038066Z opt/conda/envs/arrow/lib/python3.11/site-packages/fsspec/asyn.py:268: in _run_coros_in_chunks
2025-01-31T22:29:22.3038239Z     result, k = await done.pop()
2025-01-31T22:29:22.3038496Z opt/conda/envs/arrow/lib/python3.11/site-packages/fsspec/asyn.py:245: in _run_coro
2025-01-31T22:29:22.3038684Z     return await asyncio.wait_for(coro, timeout=timeout), i
2025-01-31T22:29:22.3038938Z opt/conda/envs/arrow/lib/python3.11/asyncio/tasks.py:452: in wait_for
2025-01-31T22:29:22.3039115Z     return await fut
2025-01-31T22:29:22.3039489Z opt/conda/envs/arrow/lib/python3.11/site-packages/s3fs/core.py:2026: in _bulk_delete
2025-01-31T22:29:22.3039642Z     out = await self._call_s3(
2025-01-31T22:29:22.3039880Z opt/conda/envs/arrow/lib/python3.11/site-packages/s3fs/core.py:371: in _call_s3
2025-01-31T22:29:22.3040064Z     return await _error_wrapper(
2025-01-31T22:29:22.3040246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-01-31T22:29:22.3040252Z 
2025-01-31T22:29:22.3040857Z func = <bound method ClientCreator._create_api_method.<locals>._api_call of <aiobotocore.client.S3 object at 0x7fa910662fe0>>
2025-01-31T22:29:22.3040865Z 
2025-01-31T22:29:22.3041066Z     async def _error_wrapper(func, *, args=(), kwargs=None, retries):
2025-01-31T22:29:22.3041219Z         if kwargs is None:
2025-01-31T22:29:22.3041312Z             kwargs = {}
2025-01-31T22:29:22.3041548Z         for i in range(retries):
2025-01-31T22:29:22.3041655Z             try:
2025-01-31T22:29:22.3041828Z                 return await func(*args, **kwargs)
2025-01-31T22:29:22.3041956Z             except S3_RETRYABLE_ERRORS as e:
2025-01-31T22:29:22.3042084Z                 err = e
2025-01-31T22:29:22.3042285Z                 logger.debug("Retryable error: %s", e)
2025-01-31T22:29:22.3042477Z                 await asyncio.sleep(min(1.7**i * 0.1, 15))
2025-01-31T22:29:22.3042594Z             except ClientError as e:
2025-01-31T22:29:22.3042797Z                 logger.debug("Client error (maybe retryable): %s", e)
2025-01-31T22:29:22.3042904Z                 err = e
2025-01-31T22:29:22.3043103Z                 wait_time = min(1.7**i * 0.1, 15)
2025-01-31T22:29:22.3043238Z                 if "SlowDown" in str(e):
2025-01-31T22:29:22.3043410Z                     await asyncio.sleep(wait_time)
2025-01-31T22:29:22.3043547Z                 elif "reduce your request rate" in str(e):
2025-01-31T22:29:22.3043707Z                     await asyncio.sleep(wait_time)
2025-01-31T22:29:22.3043852Z                 elif "XAmzContentSHA256Mismatch" in str(e):
2025-01-31T22:29:22.3044068Z                     await asyncio.sleep(wait_time)
2025-01-31T22:29:22.3044173Z                 else:
2025-01-31T22:29:22.3044316Z                     break
2025-01-31T22:29:22.3044432Z             except Exception as e:
2025-01-31T22:29:22.3044586Z                 logger.debug("Nonretryable error: %s", e)
2025-01-31T22:29:22.3048335Z                 err = e
2025-01-31T22:29:22.3048443Z                 break
2025-01-31T22:29:22.3048510Z     
2025-01-31T22:29:22.3048607Z         if "'coroutine'" in str(err):
2025-01-31T22:29:22.3048776Z             # aiobotocore internal error - fetch original botocore error
2025-01-31T22:29:22.3049022Z             tb = err.__traceback__
2025-01-31T22:29:22.3049112Z             while tb.tb_next:
2025-01-31T22:29:22.3049191Z                 tb = tb.tb_next
2025-01-31T22:29:22.3049259Z             try:
2025-01-31T22:29:22.3049369Z                 await tb.tb_frame.f_locals["response"]
2025-01-31T22:29:22.3049454Z             except Exception as e:
2025-01-31T22:29:22.3049533Z                 err = e
2025-01-31T22:29:22.3049621Z         err = translate_boto_error(err)
2025-01-31T22:29:22.3049693Z >       raise err
2025-01-31T22:29:22.3050132Z E       OSError: [Errno 5] An error occurred (MissingContentMD5) when calling the DeleteObjects operation: Missing required header for this request: Content-Md5.
2025-01-31T22:29:22.3050140Z 
2025-01-31T22:29:22.3050356Z opt/conda/envs/arrow/lib/python3.11/site-packages/s3fs/core.py:146: OSError
2025-01-31T22:29:22.3050801Z _ ERROR at teardown of test_open_output_stream_metadata[PyFileSystem(FSSpecHandler(s3fs.S3FileSystem()))] _
2025-01-31T22:29:22.3050807Z 
2025-01-31T22:29:22.3051156Z func = <bound method ClientCreator._create_api_method.<locals>._api_call of <aiobotocore.client.S3 object at 0x7fa910662fe0>>
2025-01-31T22:29:22.3051161Z 
2025-01-31T22:29:22.3051318Z     async def _error_wrapper(func, *, args=(), kwargs=None, retries):
2025-01-31T22:29:22.3051399Z         if kwargs is None:
2025-01-31T22:29:22.3051592Z             kwargs = {}
2025-01-31T22:29:22.3051677Z         for i in range(retries):
2025-01-31T22:29:22.3051744Z             try:
2025-01-31T22:29:22.3051842Z >               return await func(*args, **kwargs)

@amoeba
Copy link
Member Author

amoeba commented Feb 1, 2025

The stack trace in the errors doesn't look like it would implicate PyArrow since they're testing fsspec/s3fs but I thought I'd check it out since this release contains an S3 change (even thought it looks unrelated). I was able to reproduce the error locally but it started to seem more like a minio bug. So I updated minio from 20250118003137 to 20250120144907 (via brew) and the tests all started passing. CI uses an even older version, minio.RELEASE.2024-09-13T20-26-02Z so we might see the failures again.

Edit: This appears to have been fixed in RELEASE.2025-01-20T14-49-07Z as indicated in minio/minio#20845 (comment) so we can look at upgrading minio on CI.

cc @pitrou @h-vetinari just FYI in case knowing about minio's incompatbility with the newer AWS SDK versions is useful, see minio/minio#20845. Edit: Nevermind on that FYI, I see #45305 now.

@amoeba
Copy link
Member Author

amoeba commented Feb 1, 2025

@github-actions crossbow submit --group verify-rc-source

@amoeba
Copy link
Member Author

amoeba commented Feb 1, 2025

@github-actions crossbow submit --group packaging

Copy link

github-actions bot commented Feb 1, 2025

Revision: 1b9079c

Submitted crossbow builds: ursacomputing/crossbow @ actions-631ed85276

Task Status
verify-rc-source-cpp-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-cpp-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-cpp-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-cpp-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-cpp-macos-amd64 GitHub Actions
verify-rc-source-cpp-macos-arm64 GitHub Actions
verify-rc-source-cpp-macos-conda-amd64 GitHub Actions
verify-rc-source-csharp-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-csharp-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-csharp-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-csharp-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-csharp-macos-amd64 GitHub Actions
verify-rc-source-csharp-macos-arm64 GitHub Actions
verify-rc-source-integration-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-integration-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-integration-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-integration-macos-amd64 GitHub Actions
verify-rc-source-integration-macos-arm64 GitHub Actions
verify-rc-source-integration-macos-conda-amd64 GitHub Actions
verify-rc-source-js-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-js-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-js-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-js-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-js-macos-amd64 GitHub Actions
verify-rc-source-js-macos-arm64 GitHub Actions
verify-rc-source-python-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-python-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-python-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-python-macos-amd64 GitHub Actions
verify-rc-source-python-macos-arm64 GitHub Actions
verify-rc-source-python-macos-conda-amd64 GitHub Actions
verify-rc-source-ruby-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-ruby-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-ruby-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-ruby-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-ruby-macos-amd64 GitHub Actions
verify-rc-source-ruby-macos-arm64 GitHub Actions
verify-rc-source-windows GitHub Actions

Copy link

github-actions bot commented Feb 1, 2025

Revision: 1b9079c

Submitted crossbow builds: ursacomputing/crossbow @ actions-b32d59baa6

Task Status
almalinux-8-amd64 GitHub Actions
almalinux-8-arm64 GitHub Actions
almalinux-9-amd64 GitHub Actions
almalinux-9-arm64 GitHub Actions
amazon-linux-2023-amd64 GitHub Actions
amazon-linux-2023-arm64 GitHub Actions
centos-7-amd64 GitHub Actions
centos-8-stream-amd64 GitHub Actions
centos-8-stream-arm64 GitHub Actions
centos-9-stream-amd64 GitHub Actions
centos-9-stream-arm64 GitHub Actions
conan-maximum GitHub Actions
conan-minimum GitHub Actions
debian-bookworm-amd64 GitHub Actions
debian-bookworm-arm64 GitHub Actions
debian-trixie-amd64 GitHub Actions
debian-trixie-arm64 GitHub Actions
matlab GitHub Actions
nuget GitHub Actions
python-sdist GitHub Actions
r-binary-packages GitHub Actions
test-debian-12-docs GitHub Actions
ubuntu-focal-amd64 GitHub Actions
ubuntu-focal-arm64 GitHub Actions
ubuntu-jammy-amd64 GitHub Actions
ubuntu-jammy-arm64 GitHub Actions
ubuntu-noble-amd64 GitHub Actions
ubuntu-noble-arm64 GitHub Actions
wheel-macos-monterey-cp310-cp310-amd64 GitHub Actions
wheel-macos-monterey-cp310-cp310-arm64 GitHub Actions
wheel-macos-monterey-cp311-cp311-amd64 GitHub Actions
wheel-macos-monterey-cp311-cp311-arm64 GitHub Actions
wheel-macos-monterey-cp312-cp312-amd64 GitHub Actions
wheel-macos-monterey-cp312-cp312-arm64 GitHub Actions
wheel-macos-monterey-cp313-cp313-amd64 GitHub Actions
wheel-macos-monterey-cp313-cp313-arm64 GitHub Actions
wheel-macos-monterey-cp313-cp313t-amd64 GitHub Actions
wheel-macos-monterey-cp313-cp313t-arm64 GitHub Actions
wheel-macos-monterey-cp39-cp39-amd64 GitHub Actions
wheel-macos-monterey-cp39-cp39-arm64 GitHub Actions
wheel-manylinux-2-28-cp310-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313-amd64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313-arm64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313t-amd64 GitHub Actions
wheel-manylinux-2-28-cp313-cp313t-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp313-cp313-amd64 GitHub Actions
wheel-manylinux-2014-cp313-cp313-arm64 GitHub Actions
wheel-manylinux-2014-cp313-cp313t-amd64 GitHub Actions
wheel-manylinux-2014-cp313-cp313t-arm64 GitHub Actions
wheel-manylinux-2014-cp39-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp313-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

Use latest Minio server release, which includes a fix for minio/minio#20845

This allows us to remove the boto3 version constraint.

Yes, by existing CI tests.

Yes.
* GitHub Issue: #45305

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
@amoeba amoeba requested a review from assignUser as a code owner February 1, 2025 01:38
@h-vetinari
Copy link
Contributor

cc @pitrou @h-vetinari just FYI in case knowing about minio's incompatbility with the newer AWS SDK versions is useful, see minio/minio#20845. Edit: Nevermind on that FYI, I see #45305 now.

Yeah, this needed two fixes, one in aws-sdk and one in minio. Both landed, and we've since been building arrow 19 with newer aws (1.11.489) and testing together with the newest minio, and that combination works fine.

@amoeba
Copy link
Member Author

amoeba commented Feb 1, 2025

@github-actions crossbow submit ubuntu*

@amoeba
Copy link
Member Author

amoeba commented Feb 1, 2025

@github-actions crossbow submit wheel-windows*

@amoeba
Copy link
Member Author

amoeba commented Feb 1, 2025

@github-actions crossbow submit verify-rc-source-integration-linux-ubuntu-22.04-amd64

Copy link

github-actions bot commented Feb 1, 2025

Revision: a37799a

Submitted crossbow builds: ursacomputing/crossbow @ actions-3b40239530

Task Status
ubuntu-focal-amd64 GitHub Actions
ubuntu-focal-arm64 GitHub Actions
ubuntu-jammy-amd64 GitHub Actions
ubuntu-jammy-arm64 GitHub Actions
ubuntu-noble-amd64 GitHub Actions
ubuntu-noble-arm64 GitHub Actions

Copy link

github-actions bot commented Feb 1, 2025

Revision: a37799a

Submitted crossbow builds: ursacomputing/crossbow @ actions-0b3880e77c

Task Status
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp313-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

Copy link

github-actions bot commented Feb 1, 2025

Revision: a37799a

Submitted crossbow builds: ursacomputing/crossbow @ actions-d511691e0f

Task Status
verify-rc-source-integration-linux-ubuntu-22.04-amd64 GitHub Actions

@amoeba
Copy link
Member Author

amoeba commented Feb 1, 2025

@github-actions crossbow submit debian*

Copy link

github-actions bot commented Feb 1, 2025

Revision: a37799a

Submitted crossbow builds: ursacomputing/crossbow @ actions-3de905e799

Task Status
debian-bookworm-amd64 GitHub Actions
debian-bookworm-arm64 GitHub Actions
debian-trixie-amd64 GitHub Actions
debian-trixie-arm64 GitHub Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants