Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AssertionError when cudf.pandas is used with ydata_profiling #14516

Open
galipremsagar opened this issue Nov 28, 2023 · 1 comment
Open
Assignees
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cudf.pandas Issues specific to cudf.pandas Python Affects Python cuDF API.

Comments

@galipremsagar
Copy link
Contributor

galipremsagar commented Nov 28, 2023

I am getting an AssertionError in the ydata test note book: https://github.com/ydataai/ydata-profiling/blob/develop/tests/notebooks/meteorites.ipynb

from pathlib import Path

import numpy as np
import pandas as pd
import requests
from IPython.display import display
from IPython.utils.capture import capture_output

import ydata_profiling
from ydata_profiling.utils.cache import cache_file

file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)

df = pd.read_csv(file_name)

# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df["year"] = pd.to_datetime(df["year"], errors="coerce")

# Example: Constant variable
df["source"] = "NASA"

# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df["reclat_city"] = df["reclat"] + np.random.normal(scale=5, size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add["name"] = duplicates_to_add["name"] + " copy"

df = pd.concat([df, duplicates_to_add], ignore_index=True)

# Inline report without saving
with capture_output() as out:
    pr = df.profile_report(
        sort=None,
        html={"style": {"full_width": True}},
        progress_bar=False,
        minimal=True,
    )
    display(pr)

assert len(out.outputs) == 2
assert out.outputs[0].data["text/plain"] == "<IPython.core.display.HTML object>"
assert out.outputs[1].data["text/plain"] == ""
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[3], line 11
      3     pr = df.profile_report(
      4         sort=None,
      5         html={"style": {"full_width": True}},
      6         progress_bar=False,
      7         minimal=True,
      8     )
      9     display(pr)
---> 11 assert len(out.outputs) == 2
     12 assert out.outputs[0].data["text/plain"] == "<IPython.core.display.HTML object>"
     13 assert out.outputs[1].data["text/plain"] == ""

AssertionError: 
@galipremsagar galipremsagar added bug Something isn't working Needs Triage Need team to review and classify labels Nov 28, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Dec 14, 2023
@galipremsagar galipremsagar added the cudf.pandas Issues specific to cudf.pandas label Apr 12, 2024
@galipremsagar galipremsagar added this to the Profiler - cudf.pandas milestone Apr 12, 2024
@Matt711 Matt711 self-assigned this Aug 16, 2024
@Matt711
Copy link
Contributor

Matt711 commented Sep 5, 2024

xref #16420 and #12052

The problem is occurring because of this call pd.Timestamp.to_pydatetime(series.min()) which fails in cudf.pandas because series.min() returns a np.datetime64 object, not a pd.Timestamp. For example

# with cudf.pandas installed
series = pd.Series([pd.Timestamp(1)])
pd.Timestamp.to_pydatetime(series.min())
AttributeError: 'numpy.datetime64' object has no attribute 'month'

It should be closed by #16450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cudf.pandas Issues specific to cudf.pandas Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

3 participants