-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: use str dtype without size information #8932
Conversation
I have no idea, either (but I also didn't do an exhaustive search). @seberg, do you know which of the changes to the (I'll try to investigate / write a reproducer that uses just |
Hmmm, thanks for the ping. Puzzles me on first sight. A lot of the string functions (i.e. in Let me ping @lysnikolaou who has pushed and taken the helm for many of these string improvements. |
Thanks @seberg for chiming in. I've tracked the code further, we are using
I can't immediately see a problem inside |
What I find strange, is that the line just before seems to test a very similar function, that would seems to go through a I wonder if we have a very subtle bug completely unrelated. I.e. if you are not careful in NumPy, it is possible to forget to make a new dtype and accidentally modify an existing one. If it is something like that, it may be hard to isolate. |
the pure import numpy as np
arr = np.array(["SOme wOrd DŽ ß ᾛ ΣΣ ffi⁵Å Ç Ⅰ"]).astype(np.str_)
f = str.casefold
np.vectorize(f, otypes=[arr.dtype])(arr)
FWIW, the new string dtype doesn't have that issue, and |
Unfortunately not casefold yet but it's on our list. Adding new string ufuncs isn't terribly hard, we'd be happy to add more or review contributions if there are missing ones. I already know that pandas has some functions that wrap the regex module. Adding a regex engine to numpy is probably not going to happen but other similar things can be added. For casefolding we'll need the unicode database CPython uses. |
Could it be related to numpy/numpy#26136? |
Yes it wasNathan already pushed a fix. |
@kmuehlbauer, should we close this, now that it has been fixed upstream in |
Thanks for looking in to this Kai. It's not easy! |
Indeed, apply_ufunc triggered my interest. Nevertheless, good it could be fixed upstream. |
Aims to resolve parts of #8844.
I'm not sure this is the right location for the fix, at least it fixes those errors. AFAICT this is some issue somewhere inside
apply_ufunc
where the string dtype size is kept. So this fix removes the size information from the dtype (actually recreating it).whats-new.rst