Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update has_dtypes to allow functions as dict values of parametr items #37

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented May 27, 2017

A change to allow has_dtypes parameter items to have functions as dict values. See #36 for the issue.

A few discussion point:

I check for FunctionType in the if-statement, not whether the input is a callable. This is because various types are themselves callables (int, float etc.), and checking for callability would be ambiguous in many cases .

Secondly, the check function must return True to pass. A truthy value is not enough. I'm thinking that users will sometimes return the series or dtype in the check function, and explicitly requiring True will minimize some errors (or so my thinking goes).

Thirdly, the check function checks the series.dtype, not the series, as the function name implies. This means the check functions can't checks for orderedness and categories on a categorical, but these are not part of the dtype, and can be checked in the `´checks.verify*``-functions.

I'm open to revert on the above points, if you want me to. I will write the tests and update docs, if this proposal is accepted.

A change to allow has_dtypes parameter ``items`` to have functions as dict values. 

A few discussion point:

 I check for ``FunctionType`` not whether the input is callable. This is because variuos types are themselves callables (``int``, ``float`` etc.). 

Secondly, the check function must return ``True`` to pass. A truthy value is not enough. I'm thinking that users will sometimes return the series or dtype, and explicitly requiring ``True`` will minimize some errors (or so my thinking goes).

Thirdly, the check function checks the *series*, not the series' dtype. My thinking on this is that by checking the series rather than the dtype only, the function can also check orderedness and categories on a categorical, and these checks seem logical to place in ``has_dtypes``. I can also see the counter argument, that we're not strictly checking ``dtype``, but ``pd.api.types`` accepts both series and dtypes, so no great loss should come from using series.

I can write some tests and update docs, if this proposal is accepted.

See #36 for the issue.
Small cleanup
check ``pd.Series.dtype``, not ``pd.Series``.
@TomAugspurger
Copy link
Collaborator

Thanks for this, and sorry for not getting around to reviewing it. I'll have more time later.

I think accepting a function is fine, but one even simpler fix may be to replace the

if not dtypes[k] == v:

withn

from pandas.api.types import is_dtype_equal
if not is_dtype_equal(dtypes[k], v):

That at least covers the basics like int being equal to np.int64, but more precise types like np.int32 and np.int64 are not equal to eachother. Does that solve any of your issues?

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 6, 2017

Thanks for replying.

I think this doesn't really solve the issue, as dtype checks still have to be precise and my issue is broader dtype checks (i.e. I check for broad dtype int (as an example), but do not care about which int sub-dtype i'm dealing with (int16, int 32, int64 - doesn't matter) ....).

To check for broader dtypes, you have to use the functions in pd.api.types (as far as I can tell), so I think in the context of engarde the way to go would be:

from pandas.api import types as pat
import engarde.decorators as ed

@ed.has_dtypes(items={'A': pat.is_integer_dtype,
                      'B': pat.is_float_dtype})
def t():
    return some dataframe...

this cannnot be done by using is_dtype_equal only, AFAIKT.

I've make my proposal a bit clearer wrt. the requirement the check-function to return booleans.

Clearer errors for users.
@topper-123
Copy link
Contributor Author

BTW, pd.api.types is new in pandas 0.20, right? That would mean that using pd.api.types.has_dtype_equal inhas_dtypes would make engarde require pandas >= 0.20.

@topper-123
Copy link
Contributor Author

pd.api.types is not new in pandas 0.20, it was just that the API was formalized.

I've made some additions for tests and docs and I think I'm finished unless you see anything.

@topper-123
Copy link
Contributor Author

Hi @TomAugspurger Have you looked further at this?

On a second look, a solution that IMO would be cleaner, would be for items.values to use strings that correspond to the possible output values of pd.api.types.infer_dtype, and so use infer_dtype to check validity. The decorators would look like this:

import engarde.decorators as ed

@ed.has_dtypes(items={'A': "integer",
                  'B': "float",
                   'C': "string"})
def t():
    return some dataframe...

I like this better than the function-based solution, as you don't have to do imports offrom pandas.asi.types in the user-facing code.

@TomAugspurger
Copy link
Collaborator

That would mean that using pd.api.types.has_dtype_equal inhas_dtypes would make engarde require pandas >= 0.20.

I'm fine with only supporting the latest version of pandas

as dtype checks still have to be precise and my issue is broader dtype checks (i.e. I check for broad dtype int (as an example), but do not care about which int sub-dtype i'm dealing with (int16, int 32, int64 - doesn't matter)

is_dtype_equal will handle this for us. If the user provides a specific type, like 'int64', that will be checked precisely, otherwise, it's not:

In [20]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')
Out[20]: True

In [21]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int64')
Out[21]: True

In [22]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int32')
Out[22]: False

So has_dtypes would take a dictionary of column name to strings (dtype names). Internally we'll compare each Series .dtype with that string using pd.api.types.is_dtype_equal(series.dtype, val)

Are there any other cases where you see that not working?

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 16, 2017

Hmm, strange. I get:

In [2]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')
Out[2]: False

Are you sure you get True here? If it returns True, it would indeed solve my issue.

It is very strange that we get different return values. I'm on Pandas version 0.20.2 on Python 3.6.1 on windows 10.

EDIT: I can add that specific dtype works as expected:

In [3]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int64')
Out[3]: True

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 17, 2017

The specific dtypes of the two examples are different:

The return value of np.dtype('int') is a dtype('int32') (on my system, maybe this is dtype('int64') on your system?). The return value of np.dtype('int64') is a dtype('int64').

The definition of is_dtype_equal in pandas is this (doc-strings removed):

def is_dtype_equal(source, target):
    try:
        source = _get_dtype(source)
        target = _get_dtype(target)   
        return source == target
    except (TypeError, AttributeError):
        return False

So in this case we are just doing

In[1] dtype('int32') == dtype('int64')

which evaluates to False.

Hence using is_dtype_equal doesn't allow for checks where all int dtypes (int16, int 32 and int64 etc.) all evaluate to True.

Do you not agree?

@topper-123
Copy link
Contributor Author

Hi, @TomAugspurger. Do you not agree that

>>>pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')
False

and not True?

I'm on Pandas 0.20.2, python 3.6.1 and Windows 10, in case you're on non-windows and this is a platform-dependant difference.

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Jun 20, 2017 via email

@topper-123
Copy link
Contributor Author

Oh that makes sense, now you explain it.

This does make is_dtype_equal(..., 'int') behave a bit randomly and fail differently on different machines, which isn't so good for a validation framework.

I'll lay of this issue and let you decide what you want to do.

@TomAugspurger
Copy link
Collaborator

This does make is_dtype_equal(..., 'int') behave a bit randomly and fail differently on different machines, which isn't so good for a validation framework.

Yeah, that was my initial thought too. I just want to think more about whether there are cases where we would want to follow the platform integer size (I haven't come up with any yet though).

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 22, 2017

I promised to stay away, but I couldn't and tried to solve this... :-)

The fundamental issue of testing for dtype hierarchy is that numpy is quite a bit illogical in its own dtype hierarchy checking:

>>>import numpy as np
>>>np.issubdtype(np.dtype('int32'), np.int64)
False
>>>np.issubdtype(np.dtype('int32'), 'int64')  # note: a string
True  # WTF

This behaviour doesn't make any sense wrt. type checking. I don't know if there are other reasons for the implementation in numpy, but I doubt it really. Probably it's the way it is for legacy reasons.

I've asked about this in numpy issue #3218.

Because of the above, the solution for sane dtype hierarchy checking in pandas (see below) is a bit wordy. Using this solution though, we get maximum usability and can check both for specific dtypes and dtypes classes, e.g.:

@has_dtypes({
'A': 'int', #any int
'A': np.int  # same as above
'A': 'int32'  #int32 specifically, fails on int64 etc.
'A': np.int32  # same as above
'A': np.number  # any int, float or other numpy number
'A': np.generic  # any numpy object
'B': object  # any python object
'B': 'category'  # pandas categorical
'C': bool  # a boolean
'D': 'datetime64'  # np.datetime64 object
'D': np.datetime64  # same as above
})
def check():
    return some_df

I think this is quite nice and clean code from the user's perspective.

The code for a new has_dtypes is below. It passes all tests for engarde. Several of the above tests are not possible with engarde today, so the tests should be expanded.

def has_dtypes(df, items):
    is_dtype_equal = pd.api.types.is_dtype_equal

    def issubdtype(arg1, arg2):
        """Mimics np.issubdype, but without the .mro() nonsense.
        """
        if np.issubclass_(arg2, np.generic):
            return issubclass(np.dtype(arg1).type, arg2)
        return issubclass(np.dtype(arg1).type, np.dtype(arg2).type)

    for k, v in items.items():
        d = df.dtypes[k]
        try:
            assert issubdtype(d, v), ("Column {!r} has the wrong dtype {!r}. Check requires {!r}"
                                      .format(k, d, v))
        except TypeError:
            assert is_dtype_equal(d, v), ("Column {!r} has the wrong dtype {!r}. check requires {!r}"
                                          .format(k, d, v))
    return df

Would you be willing to use this in engarde?

I also wonder whether something from the code above should be added to pandas type checking, unless np.issubdtype will be fixed (which it won't, as that'll probably break a lot of numpy code....). Sane dtype hierarchy checking should obviously not need to be this complex.

@topper-123
Copy link
Contributor Author

Hi @TomAugspurger ,

Have you considered a solution to this? It would be very nice if has_dtypes could accept np.integer, np.number etc. IMO the idiomatic way to do this would be to use the functions in pd.api.types as that is what's officially supported by pandas for dtype checking.

That would mean something similar to the code I've submitted, though I do think that an approach like the above has_dtypes would be 'better', but it's not the pandas waý. I've opened a ticket (#16768) on the pandas issue tracker on this, we'll see if it'll be accepted.

Regards.

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Jun 28, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants