Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When column is of type int, histogram acts different from pandas dataframe #22

Open
itamar-otonomo opened this issue Nov 24, 2019 · 0 comments

Comments

@itamar-otonomo
Copy link

When I summon a hist from a Pandas column (Series) containing integers I get a proper histogram where the x axis is divided to bins of value ranges.
When I do the same using a handy DataFrame I get a categorical histogram.

I dug into the code and the reason for the way handy acts is that the column of integers is not defined as a member of the self._continuous group of columns.

hist uses the continuous list as an indication of using categorical for non continuous. This is why a hist of integers in handy is not what one would expect from a hist of integers in Pandas.

a workaround is to cast the integer column to floats. I think this is a bug (couldn't find anything in the docs).

Here's a quick repro code..

pdf = pd.DataFrame({'bobo': np.random.randint(0, 100, 5000)})
df = spark.createDataFrame(pdf).withColumn('float_bobo', F.col('bobo').astype('float'))
hdf = df.toHandy()
pdf.bobo.hist()
hdf.cols['bobo'].hist()
hdf.cols['float_bobo'].hist()

I forgot to congratulate you on this great lib, it really is cool!

Itamar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant