Update has_dtypes to allow functions as dict values of parametr items #37

topper-123 · 2017-05-27T20:12:45Z

A change to allow has_dtypes parameter items to have functions as dict values. See #36 for the issue.

A few discussion point:

I check for FunctionType in the if-statement, not whether the input is a callable. This is because various types are themselves callables (int, float etc.), and checking for callability would be ambiguous in many cases .

Secondly, the check function must return True to pass. A truthy value is not enough. I'm thinking that users will sometimes return the series or dtype in the check function, and explicitly requiring True will minimize some errors (or so my thinking goes).

Thirdly, the check function checks the series.dtype, not the series, as the function name implies. This means the check functions can't checks for orderedness and categories on a categorical, but these are not part of the dtype, and can be checked in the `´checks.verify*``-functions.

I'm open to revert on the above points, if you want me to. I will write the tests and update docs, if this proposal is accepted.

A change to allow has_dtypes parameter ``items`` to have functions as dict values. A few discussion point: I check for ``FunctionType`` not whether the input is callable. This is because variuos types are themselves callables (``int``, ``float`` etc.). Secondly, the check function must return ``True`` to pass. A truthy value is not enough. I'm thinking that users will sometimes return the series or dtype, and explicitly requiring ``True`` will minimize some errors (or so my thinking goes). Thirdly, the check function checks the *series*, not the series' dtype. My thinking on this is that by checking the series rather than the dtype only, the function can also check orderedness and categories on a categorical, and these checks seem logical to place in ``has_dtypes``. I can also see the counter argument, that we're not strictly checking ``dtype``, but ``pd.api.types`` accepts both series and dtypes, so no great loss should come from using series. I can write some tests and update docs, if this proposal is accepted. See #36 for the issue.

Small cleanup

check ``pd.Series.dtype``, not ``pd.Series``.

TomAugspurger · 2017-06-06T12:55:21Z

Thanks for this, and sorry for not getting around to reviewing it. I'll have more time later.

I think accepting a function is fine, but one even simpler fix may be to replace the

if not dtypes[k] == v:

withn

from pandas.api.types import is_dtype_equal
if not is_dtype_equal(dtypes[k], v):

That at least covers the basics like int being equal to np.int64, but more precise types like np.int32 and np.int64 are not equal to eachother. Does that solve any of your issues?

topper-123 · 2017-06-06T15:33:55Z

Thanks for replying.

I think this doesn't really solve the issue, as dtype checks still have to be precise and my issue is broader dtype checks (i.e. I check for broad dtype int (as an example), but do not care about which int sub-dtype i'm dealing with (int16, int 32, int64 - doesn't matter) ....).

To check for broader dtypes, you have to use the functions in pd.api.types (as far as I can tell), so I think in the context of engarde the way to go would be:

from pandas.api import types as pat
import engarde.decorators as ed

@ed.has_dtypes(items={'A': pat.is_integer_dtype,
                      'B': pat.is_float_dtype})
def t():
    return some dataframe...

this cannnot be done by using is_dtype_equal only, AFAIKT.

I've make my proposal a bit clearer wrt. the requirement the check-function to return booleans.

Clearer errors for users.

topper-123 · 2017-06-06T15:59:14Z

BTW, pd.api.types is new in pandas 0.20, right? That would mean that using pd.api.types.has_dtype_equal inhas_dtypes would make engarde require pandas >= 0.20.

topper-123 · 2017-06-06T22:23:37Z

pd.api.types is not new in pandas 0.20, it was just that the API was formalized.

I've made some additions for tests and docs and I think I'm finished unless you see anything.

topper-123 · 2017-06-16T00:36:35Z

Hi @TomAugspurger Have you looked further at this?

On a second look, a solution that IMO would be cleaner, would be for items.values to use strings that correspond to the possible output values of pd.api.types.infer_dtype, and so use infer_dtype to check validity. The decorators would look like this:

import engarde.decorators as ed

@ed.has_dtypes(items={'A': "integer",
                  'B': "float",
                   'C': "string"})
def t():
    return some dataframe...

I like this better than the function-based solution, as you don't have to do imports offrom pandas.asi.types in the user-facing code.

TomAugspurger · 2017-06-16T11:38:58Z

That would mean that using pd.api.types.has_dtype_equal inhas_dtypes would make engarde require pandas >= 0.20.

I'm fine with only supporting the latest version of pandas

as dtype checks still have to be precise and my issue is broader dtype checks (i.e. I check for broad dtype int (as an example), but do not care about which int sub-dtype i'm dealing with (int16, int 32, int64 - doesn't matter)

is_dtype_equal will handle this for us. If the user provides a specific type, like 'int64', that will be checked precisely, otherwise, it's not:

In [20]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')
Out[20]: True

In [21]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int64')
Out[21]: True

In [22]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int32')
Out[22]: False

So has_dtypes would take a dictionary of column name to strings (dtype names). Internally we'll compare each Series .dtype with that string using pd.api.types.is_dtype_equal(series.dtype, val)

Are there any other cases where you see that not working?

topper-123 · 2017-06-16T23:19:10Z

Hmm, strange. I get:

In [2]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')
Out[2]: False

Are you sure you get True here? If it returns True, it would indeed solve my issue.

It is very strange that we get different return values. I'm on Pandas version 0.20.2 on Python 3.6.1 on windows 10.

EDIT: I can add that specific dtype works as expected:

In [3]: pd.api.types.is_dtype_equal(np.dtype('int64'), 'int64')
Out[3]: True

topper-123 · 2017-06-17T10:33:38Z

The specific dtypes of the two examples are different:

The return value of np.dtype('int') is a dtype('int32') (on my system, maybe this is dtype('int64') on your system?). The return value of np.dtype('int64') is a dtype('int64').

The definition of is_dtype_equal in pandas is this (doc-strings removed):

def is_dtype_equal(source, target):
    try:
        source = _get_dtype(source)
        target = _get_dtype(target)   
        return source == target
    except (TypeError, AttributeError):
        return False

So in this case we are just doing

In[1] dtype('int32') == dtype('int64')

which evaluates to False.

Hence using is_dtype_equal doesn't allow for checks where all int dtypes (int16, int 32 and int64 etc.) all evaluate to True.

Do you not agree?

topper-123 · 2017-06-20T21:04:24Z

Hi, @TomAugspurger. Do you not agree that

>>>pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')
False

and not True?

I'm on Pandas 0.20.2, python 3.6.1 and Windows 10, in case you're on non-windows and this is a platform-dependant difference.

TomAugspurger · 2017-06-20T21:25:40Z

Yeah, it comes down to the python / platform. On 64-bit OSes it'll be True (and `is_dtype_equal(np.dtype('int32'), 'int') will be False) I'll need to think a bit about what the desired behavior is here.

…

On Tue, Jun 20, 2017 at 4:04 PM, topper-123 ***@***.***> wrote: Hi, @TomAugspurger <https://github.com/tomaugspurger>. Do you not agree that >>>pd.api.types.is_dtype_equal(np.dtype('int64'), 'int')False and not True? I'm on Pandas 0.20.2, python 3.6.1 and Windows 10, in case you're on non-windows and this is a platform-dependant difference. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIvV6lYlu8ehWNEjkbrPM0sbQtetkks5sGDPZgaJpZM4NodTm> .

topper-123 · 2017-06-20T22:48:41Z

Oh that makes sense, now you explain it.

This does make is_dtype_equal(..., 'int') behave a bit randomly and fail differently on different machines, which isn't so good for a validation framework.

I'll lay of this issue and let you decide what you want to do.

TomAugspurger · 2017-06-20T23:01:15Z

This does make is_dtype_equal(..., 'int') behave a bit randomly and fail differently on different machines, which isn't so good for a validation framework.

Yeah, that was my initial thought too. I just want to think more about whether there are cases where we would want to follow the platform integer size (I haven't come up with any yet though).

topper-123 · 2017-06-22T19:49:10Z

I promised to stay away, but I couldn't and tried to solve this... :-)

The fundamental issue of testing for dtype hierarchy is that numpy is quite a bit illogical in its own dtype hierarchy checking:

>>>import numpy as np
>>>np.issubdtype(np.dtype('int32'), np.int64)
False
>>>np.issubdtype(np.dtype('int32'), 'int64')  # note: a string
True  # WTF

This behaviour doesn't make any sense wrt. type checking. I don't know if there are other reasons for the implementation in numpy, but I doubt it really. Probably it's the way it is for legacy reasons.

I've asked about this in numpy issue #3218.

Because of the above, the solution for sane dtype hierarchy checking in pandas (see below) is a bit wordy. Using this solution though, we get maximum usability and can check both for specific dtypes and dtypes classes, e.g.:

@has_dtypes({
'A': 'int', #any int
'A': np.int  # same as above
'A': 'int32'  #int32 specifically, fails on int64 etc.
'A': np.int32  # same as above
'A': np.number  # any int, float or other numpy number
'A': np.generic  # any numpy object
'B': object  # any python object
'B': 'category'  # pandas categorical
'C': bool  # a boolean
'D': 'datetime64'  # np.datetime64 object
'D': np.datetime64  # same as above
})
def check():
    return some_df

I think this is quite nice and clean code from the user's perspective.

The code for a new has_dtypes is below. It passes all tests for engarde. Several of the above tests are not possible with engarde today, so the tests should be expanded.

def has_dtypes(df, items):
    is_dtype_equal = pd.api.types.is_dtype_equal

    def issubdtype(arg1, arg2):
        """Mimics np.issubdype, but without the .mro() nonsense.
        """
        if np.issubclass_(arg2, np.generic):
            return issubclass(np.dtype(arg1).type, arg2)
        return issubclass(np.dtype(arg1).type, np.dtype(arg2).type)

    for k, v in items.items():
        d = df.dtypes[k]
        try:
            assert issubdtype(d, v), ("Column {!r} has the wrong dtype {!r}. Check requires {!r}"
                                      .format(k, d, v))
        except TypeError:
            assert is_dtype_equal(d, v), ("Column {!r} has the wrong dtype {!r}. check requires {!r}"
                                          .format(k, d, v))
    return df

Would you be willing to use this in engarde?

I also wonder whether something from the code above should be added to pandas type checking, unless np.issubdtype will be fixed (which it won't, as that'll probably break a lot of numpy code....). Sane dtype hierarchy checking should obviously not need to be this complex.

topper-123 · 2017-06-28T07:21:37Z

Hi @TomAugspurger ,

Have you considered a solution to this? It would be very nice if has_dtypes could accept np.integer, np.number etc. IMO the idiomatic way to do this would be to use the functions in pd.api.types as that is what's officially supported by pandas for dtype checking.

That would mean something similar to the code I've submitted, though I do think that an approach like the above has_dtypes would be 'better', but it's not the pandas waý. I've opened a ticket (#16768) on the pandas issue tracker on this, we'll see if it'll be accepted.

Regards.

TomAugspurger · 2017-06-28T14:18:21Z

Haven't had a chance to look yet.

…

On Wed, Jun 28, 2017 at 2:21 AM, topper-123 ***@***.***> wrote: Hi @TomAugspurger <https://github.com/tomaugspurger> , Have you considered a solution to this? It would be very nice if has_dtypes could accept np.integer, np.number etc. IMO the idiomatic way to do this would be to use the functions in pd.api.types as that is what's officially supported by pandas for dtype checking. That would mean something similar to the code I've submitted, though I do think that an approach like the above has_dtypes would be 'better', but it's not the pandas waý. I've opened a ticket (#16768) on the pandas issue tracker on this, we'll see if it'll be accepted. Regards. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#37 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIhxMhJryzTXozN1UY2hcoImLz9Z7ks5sIf8BgaJpZM4NodTm> .

topper-123 added 4 commits May 27, 2017 21:09

Update checks.py

73a9d55

Small cleanup

check pd.Series.dtype, not pd.Series.

da5cffe

check ``pd.Series.dtype``, not ``pd.Series``.

Update checks.py

4470e83

Making errors clearer

2885137

Clearer errors for users.

topper-123 added 3 commits June 6, 2017 22:44

add test for has_dtypes with function as items values

82049d3

Update docs.

38d1d13

Update docs for has_dtypes with an example

2d85bbe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update has_dtypes to allow functions as dict values of parametr items #37

Update has_dtypes to allow functions as dict values of parametr items #37

topper-123 commented May 27, 2017 •

edited

Loading

TomAugspurger commented Jun 6, 2017

topper-123 commented Jun 6, 2017 •

edited

Loading

topper-123 commented Jun 6, 2017

topper-123 commented Jun 6, 2017

topper-123 commented Jun 16, 2017

TomAugspurger commented Jun 16, 2017

topper-123 commented Jun 16, 2017 •

edited

Loading

topper-123 commented Jun 17, 2017 •

edited

Loading

topper-123 commented Jun 20, 2017

TomAugspurger commented Jun 20, 2017 via email

topper-123 commented Jun 20, 2017

TomAugspurger commented Jun 20, 2017

topper-123 commented Jun 22, 2017 •

edited

Loading

topper-123 commented Jun 28, 2017

TomAugspurger commented Jun 28, 2017 via email

Update has_dtypes to allow functions as dict values of parametr items #37

Are you sure you want to change the base?

Update has_dtypes to allow functions as dict values of parametr items #37

Conversation

topper-123 commented May 27, 2017 • edited Loading

TomAugspurger commented Jun 6, 2017

topper-123 commented Jun 6, 2017 • edited Loading

topper-123 commented Jun 6, 2017

topper-123 commented Jun 6, 2017

topper-123 commented Jun 16, 2017

TomAugspurger commented Jun 16, 2017

topper-123 commented Jun 16, 2017 • edited Loading

topper-123 commented Jun 17, 2017 • edited Loading

topper-123 commented Jun 20, 2017

TomAugspurger commented Jun 20, 2017 via email

topper-123 commented Jun 20, 2017

TomAugspurger commented Jun 20, 2017

topper-123 commented Jun 22, 2017 • edited Loading

topper-123 commented Jun 28, 2017

TomAugspurger commented Jun 28, 2017 via email

topper-123 commented May 27, 2017 •

edited

Loading

topper-123 commented Jun 6, 2017 •

edited

Loading

topper-123 commented Jun 16, 2017 •

edited

Loading

topper-123 commented Jun 17, 2017 •

edited

Loading

topper-123 commented Jun 22, 2017 •

edited

Loading