Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Unsupported type <class 'numpy.ndarray'> #1405

Open
trivialfis opened this issue Nov 19, 2024 · 6 comments
Open

TypeError: Unsupported type <class 'numpy.ndarray'> #1405

trivialfis opened this issue Nov 19, 2024 · 6 comments

Comments

@trivialfis
Copy link
Member

Script:

import dask
from dask import array as da
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from distributed import Client, LocalCluster
import numpy as np


def main(client: Client) -> None:
    rng = da.random.default_rng(1994)
    X = rng.random(size=(2048, 4))

    df = dd.from_dask_array(X, columns=[f"f{i}" for i in range(4)])
    df["qid"] = rng.integers(low=0, high=4, size=(2048, ), dtype=np.int64)
    s = da.cumsum(df.groupby("qid").qid.count().to_dask_array(lengths=True)).compute()
    print(s)


if __name__ == "__main__":
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            with dask.config.set(
                {"array.backend": "cupy", "dataframe.backend": "cudf"}
            ):
                main(client)

Version:

  • dask, version 2024.9.0
  • dask_cuda: 24.10.00
  • cupy: 13.3.0
  • cudf: 24.10.01
  • python: 3.12.0
@pentschev
Copy link
Member

I'm no expert when it comes to Dask dataframes, after some investigation I've found the problem with the code above is to_dask_array does not automatically ensure meta is converted to match CuPy, adding meta=cp.array(()) resolves the problem. Here's the complete code with the modification that fixes the issue:

import dask
from dask import array as da
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from distributed import Client, LocalCluster
import numpy as np
import cupy as cp


def main(client: Client) -> None:
    rng = da.random.default_rng(1994)
    X = rng.random(size=(2048, 4))

    df = dd.from_dask_array(X, columns=[f"f{i}" for i in range(4)])
    df["qid"] = rng.integers(low=0, high=4, size=(2048, ), dtype=np.int64)
    s = da.cumsum(df.groupby("qid").qid.count().to_dask_array(lengths=True, meta=cp.array(()))).compute()
    print(s)


if __name__ == "__main__":
    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:
            with dask.config.set(
                {"array.backend": "cupy", "dataframe.backend": "cudf"}
            ):
                main(client)

During the investigation I could not find many uses of to_dask_array, there are some in the Dask repository and in one test in Distributed but that's it, there are none in the cuDF repository. This seems to suggest there other more appropriate ways of rewritting the code without the need for to_dask_array, perhaps @rjzamora can give some ideas here.

In any case, specifying meta appropriately resolves the original issue, but perhaps not in the desired fashion.

@trivialfis
Copy link
Member Author

Thank you for looking into this @pentschev ! I will leave this open in case this is considered a bug.

@rjzamora
Copy link
Member

Thanks for raising this @trivialfis - There are some known rough edges in DataFrame <-> Array conversion. I believe this is indeed a bug caused by the ongoing removal of legacy Dask DataFrame (dask/dask-expr#1168).

@pentschev
Copy link
Member

I believe this is indeed a bug caused by the ongoing removal of legacy Dask DataFrame (dask/dask-expr#1168).

FWIW, this is reproducible with DASK_DATAFRAME__QUERY_PLANNING=False too.

@rjzamora
Copy link
Member

FWIW, this is reproducible with DASK_DATAFRAME__QUERY_PLANNING=False too.

Okay, thanks - That's useful info. The fix will require an upstream change to to_dask_array or a custom method in Dask cuDF. So, either way we will need to target 25.02.

@jakirkham
Copy link
Member

Would this just be the inverse of the logic in from_dask_array?

xref: dask/dask#9579

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants