Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: use str dtype without size information #8932

Closed
wants to merge 1 commit into from

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Apr 12, 2024

Aims to resolve parts of #8844.

xarray/tests/test_accessor_str.py::test_case_str: AssertionError: assert dtype('<U26') == dtype('<U30')

I'm not sure this is the right location for the fix, at least it fixes those errors. AFAICT this is some issue somewhere inside apply_ufunc where the string dtype size is kept. So this fix removes the size information from the dtype (actually recreating it).

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@keewis
Copy link
Collaborator

keewis commented Apr 12, 2024

I have no idea, either (but I also didn't do an exhaustive search).

@seberg, do you know which of the changes to the 2.0 branch of numpy in the past ~2-3 weeks might have caused this (or longer, we're using the scientific-python-nightly-wheels repo)

(I'll try to investigate / write a reproducer that uses just numpy so you don't have to look through our code to understand what's going on)

@seberg
Copy link

seberg commented Apr 12, 2024

Hmmm, thanks for the ping. Puzzles me on first sight. A lot of the string functions (i.e. in np.char.... did see changes, but the code change looks more fundamental, which I find odd. A "NumPy repro" would be awesome.

Let me ping @lysnikolaou who has pushed and taken the helm for many of these string improvements.

@kmuehlbauer
Copy link
Contributor Author

Thanks @seberg for chiming in. I've tracked the code further, we are using

np.vectorize along the process with the otypes kwarg set to something like dtype('<U26') (the number depends on the size of the string) and the string size changes in the vectorized function.

I can't immediately see a problem inside np.vectorize as it uses a similar approach to get rid of any size information. So the issue might still be somewhere in our code.

@seberg
Copy link

seberg commented Apr 12, 2024

What I find strange, is that the line just before seems to test a very similar function, that would seems to go through a lambda x: x.capitalize() while this is just lambda x: x.casefold(), if I understand that right.

I wonder if we have a very subtle bug completely unrelated. I.e. if you are not careful in NumPy, it is possible to forget to make a new dtype and accidentally modify an existing one. If it is something like that, it may be hard to isolate.

@keewis
Copy link
Collaborator

keewis commented Apr 12, 2024

the pure numpy reproducer is (which was kinda what @kmuehlbauer described, but now you can try yourself):

import numpy as np

arr = np.array(["SOme wOrd DŽ ß ᾛ ΣΣ ffi⁵Å Ç Ⅰ"]).astype(np.str_)
f = str.casefold
np.vectorize(f, otypes=[arr.dtype])(arr)

str.casefold returns a string that's four characters longer, and before numpy<2 the dtype would be extended to fit the string (<30U) while with numpy>=2 the old dtype would be used, truncating the string.

str.capitalize does not change the length of the string, which is why that doesn't fail.

FWIW, the new string dtype doesn't have that issue, and numpy.strings has a lot of the vectorized functions we've been creating using np.vectorize (i.e. those will most likely be much faster).

@ngoldbaum
Copy link

FWIW, the new string dtype doesn't have that issue, and numpy.strings has a lot of the vectorized functions we've been creating using np.vectorize (i.e. those will most likely be much faster).

Unfortunately not casefold yet but it's on our list.

Adding new string ufuncs isn't terribly hard, we'd be happy to add more or review contributions if there are missing ones. I already know that pandas has some functions that wrap the regex module. Adding a regex engine to numpy is probably not going to happen but other similar things can be added. For casefolding we'll need the unicode database CPython uses.

@mathause
Copy link
Collaborator

Could it be related to numpy/numpy#26136?

@seberg
Copy link

seberg commented Apr 12, 2024

Yes it wasNathan already pushed a fix.

@keewis
Copy link
Collaborator

keewis commented Apr 13, 2024

@kmuehlbauer, should we close this, now that it has been fixed upstream in numpy?

@dcherian
Copy link
Contributor

Thanks for looking in to this Kai. It's not easy!

@kmuehlbauer
Copy link
Contributor Author

Thanks for looking in to this Kai. It's not easy!

Indeed, apply_ufunc triggered my interest. Nevertheless, good it could be fixed upstream.

@kmuehlbauer kmuehlbauer deleted the string_dtype branch April 15, 2024 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-upstream Run upstream CI
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants