Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove arrow dependency #16640

Merged
merged 52 commits into from
Aug 27, 2024
Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Aug 22, 2024

Description

This PR removes libarrow as a dependency of libcudf since we no longer use any of its APIs in our C++ code. The following places remain dependent on libarrow:

  • tests: We have tests demonstrating how to interoperate with libarrow objects, as well as other tests that leverage Arrow for I/O.
  • examples: We have an example demonstrating interop with libarrow arrays.
  • JNI: The JNI is still using libarrow to handle ingestion or production of Arrow buffers.

In all three cases above, we are now statically linking libarrow. We also always pull it in via CPM, which means that we never require libarrow to exist on the user's system anymore. Of the above three cases, we should expect the first two to persist indefinitely. The JNI could be updated to use nanoarrow instead if desired, but that is not critical since the primary benefit of removing libarrow as a direct dependency is to remove it as a constraint for package managers such as conda in environments where we must match the version of Arrow required by other dependencies.

pyarrow remains a dependency of the cudf Python packages. For now, this PR retains the tight pinning on 16.1 since we know that this version works. A future PR will loosen this pinning since we are no longer constrained to ABI-compatible versions and can support a range of pyarrow versions that support the necessary Python APIs (I believe pyarrow>=13 will work, but that remains to be tested).

Resolves #15193

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vyasr vyasr self-assigned this Aug 22, 2024
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. labels Aug 22, 2024
@vyasr vyasr added improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. labels Aug 22, 2024
@vyasr
Copy link
Contributor Author

vyasr commented Aug 27, 2024

Waiting on some final confirmation that NVIDIA/spark-rapids-jni#2357 works for the plugin and then this should be good to go.

@vyasr
Copy link
Contributor Author

vyasr commented Aug 27, 2024

/merge

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @vyasr. Approving, on behalf of CUDF JNI.

@rapids-bot rapids-bot bot merged commit 1a2aad2 into rapidsai:branch-24.10 Aug 27, 2024
86 checks passed
@vyasr vyasr deleted the feat/remove_arrow_linkage branch August 27, 2024 20:46
rapids-bot bot pushed a commit to rapidsai/cuspatial that referenced this pull request Aug 29, 2024
Contributes to rapidsai/build-planning#33.

Proposes the following for `cuspatial` wheels:

* add build and runtime dependencies on `libcudf` wheels
* stop vendoring copies of `libcudf.so`, `libnvcomp.so`, `libnvcomp_bitcomp.so`, and `libnvcomp_gdeflate.so`
  - *(load `libcudf.so` dynamically at runtime instead)*

And other related changes for development/CI:

* combine all `pip install` calls into 1 in wheel-testing scripts
  - *like rapidsai/cudf#16575
  - *to improve the chance that packaging issues are discovered in CI*
* `dependencies.yaml` changes:
   - more use of YAML anchors = less duplication
   - use dedicated `depends_on_librmm` and `depends_on_libcudf` groups
* explicitly pass a package type to `gha-tools` wheel uploading/downloading scripts

## Notes for Reviewers

### Benefits of these changes

Unblocks CI in this repo (ref: #1444 (comment), #1441 (comment)).

Reduces wheel sizes for `cuspatial` wheels by about 125MB 😁 

| wheel          | size (before)  | size (this PR) |
|:-----------:|-------------:|---------------:|
| `cuspatial` |   146.0M        |   21M               |
| `cuproj `     |       0.9M       |   0.9M              |
|**TOTAL**   |  **146.9M** | **21.9M**        |

*NOTES: size = compressed, "before" = 2024-08-21 nightlies (c60bd4d), CUDA = 12, Python = 3.11*

<details><summary>how I calculated those (click me)</summary>

```shell
# note: 2024-08-21 because that was the most recent date with
#           successfully-built cuspatial nightlies
#
docker run \
    --rm \
    -v $(pwd):/opt/work:ro \
    -w /opt/work \
    --network host \
    --env RAPIDS_NIGHTLY_DATE=2024-08-21 \
    --env RAPIDS_NIGHTLY_SHA=c60bd4d \
    --env RAPIDS_PR_NUMBER=1447 \
    --env RAPIDS_PY_CUDA_SUFFIX=cu12 \
    --env RAPIDS_REPOSITORY=rapidsai/cuspatial \
    --env WHEEL_DIR_BEFORE=/tmp/wheels-before \
    --env WHEEL_DIR_AFTER=/tmp/wheels-after \
    -it rapidsai/ci-wheel:cuda12.5.1-rockylinux8-py3.11 \
    bash

mkdir -p "${WHEEL_DIR_BEFORE}"
mkdir -p "${WHEEL_DIR_AFTER}"

py_projects=(
    cuspatial
    cuproj
)

for project in "${py_projects[@]}"; do
    # before
    RAPIDS_BUILD_TYPE=nightly \
    RAPIDS_PY_WHEEL_NAME="${project}_${RAPIDS_PY_CUDA_SUFFIX}" \
    RAPIDS_REF_NAME="branch-24.10" \
    RAPIDS_SHA=${RAPIDS_NIGHTLY_SHA} \
        rapids-download-wheels-from-s3 python "${WHEEL_DIR_BEFORE}"

    # after
    RAPIDS_BUILD_TYPE=pull-request \
    RAPIDS_PY_WHEEL_NAME="${project}_${RAPIDS_PY_CUDA_SUFFIX}" \
    RAPIDS_REF_NAME="pull-request/${RAPIDS_PR_NUMBER}" \
        rapids-download-wheels-from-s3 python "${WHEEL_DIR_AFTER}"
done

du -sh ${WHEEL_DIR_BEFORE}/*
du -sh ${WHEEL_DIR_BEFORE}
du -sh ${WHEEL_DIR_AFTER}/*
du -sh ${WHEEL_DIR_AFTER}
```

</details>

Reduces the amount of additional work required to start shipping `libcuspatial` wheels.

### Background

This is part of ongoing work towards packaging `libcuspatial` as a wheel.

relevant prior work:

* packaging `libcudf` wheels: rapidsai/cudf#15483
* consolidating `pip install` calls in CI scripts for `cudf`: rapidsai/cudf#16575
* `cudf` dropping its Arrow library dependency: rapidsai/cudf#16640

### How I tested this

Confirmed in local builds and CI logs that `cudf` is being *found*, not *built*, in `cuspatial` builds.

```text
-- CPM: Using local package [email protected]
```

([build link](https://github.com/rapidsai/cuspatial/actions/runs/10602971716/job/29386288614?pr=1447#step:9:23472))

Built `cuspatial` wheels locally and ran all the unit tests, without issue.

#

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #1447
rapids-bot bot pushed a commit that referenced this pull request Sep 3, 2024
Fixes #16678.  Adds the boost-devel package to the Java CI Docker environment now that the Boost headers are not being picked up implicitly after libcudf dropped the Arrow dependency in #16640.  libcudfjni still requires Arrow for now, and thus requires Boost headers.

Authors:
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Alessandro Bellina (https://github.com/abellina)
  - Bradley Dice (https://github.com/bdice)

URL: #16707
@ttnghia
Copy link
Contributor

ttnghia commented Sep 4, 2024

I still see cudf/cpp/cmake/thirdparty/get_arrow.cmake. Is this needs to be removed too?

@vyasr
Copy link
Contributor Author

vyasr commented Sep 4, 2024

No. We still use arrow in tests, examples, and the JNI, and that CMake code is used in all three. In all three cases we now build and statically link to Arrow as part of the build, though, so there is no longer any library dependency in those cases. The critical change in this PR is removing the dependency from libcudf and the cudf Python modules.

vyasr added a commit to rapidsai/devcontainers that referenced this pull request Sep 23, 2024
There have been a number of changes to RAPIDS builds over the course of
this release and not all changes were fully propagated to the
devcontainers repo. This repo addresses the following:
- As of rapidsai/kvikio#369, kvikio produces
wheels, and rapidsai/kvikio#439 contains
critical fixes that allow the kvikio Python wheel to use the C++
libkvikio wheel. In RAPIDS Python builds we have consistently removed
support for the Python build triggering the C++ build as we have created
C++ wheels since in both conda and pip environments we now expect the
library to be found and we do not need to automatically support the more
esoteric use case of someone turning off build isolation but not having
the C++ library available (devs can handle this case themselves if they
wish). As a result, once rapidsai/kvikio#466 is
merged, the `FIND_KVIKIO_CPP` variable will be completely superfluous
and we can remove that here.
- As of rapidsai/cudf#16640 libcudf no longer
links to libarrow and `USE_LIBARROW_FROM_PYARROW` is no longer used.
- The libcudf and libcuspatial Python package builds in the devcontainer
should (like all other Python packages) omit the CUDA version suffix.
For that, they need to use the `rapids_build_backend_args`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Reduce arrow library dependencies in cudf
7 participants