Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyArrayLike type introduced #383

Merged
merged 1 commit into from
Oct 7, 2023
Merged

PyArrayLike type introduced #383

merged 1 commit into from
Oct 7, 2023

Conversation

124C41p
Copy link
Contributor

@124C41p 124C41p commented Jul 6, 2023

Summary

Extracts a read only reference if the correct numpy array type is given. Tries to convert the input into the correct type otherwise.

Resolves #382

Disclaimer

This implementation seems to work as intended and covers my personal use case. However, I am not sure how well it performs (you certainly want it to be not significantly slower than calling np.asarray(...) on Python side, and then extracting PyReadonlyArray<...> on Rust side). Also, there are several tasks remaining like documentation and proper error handling.

Do you want to take over at this point for finalizing the branch in your own style?

Copy link
Member

@adamreichold adamreichold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to take over at this point for finalizing the branch in your own style?

I would like to encourage you to continue working on this until it is ready to be merged. Projects like this benefit from more people with different styles working on them.

I added some smaller comments inline, but the main point I am wonder now that I have look at the pull request is whether the converted case should be a plain ndarray::Array or rather a NumPy array as well. This would have the benefit of a more uniform API and would allow the result to be passed to functions expecting a NumPy array. I am not sure if the cost of creating a NumPy array versus an ndarray::Array is prohibitive though.

I think spelled out this would look something like

pub struct PyArrayLike<'py, T, D>(PyReadonlyArray<'py, T, D>);

impl Deref for PyArrayLike<'py, T, D> {
  type Target = PyReadonlyArray<'py, T, D>;

  ...
}

impl FromPyObject<'py> for PyArrayLike<'py, T, D> {
   // ... as before, just constructing `PyArray` ...
}

What do you think?

src/lib.rs Outdated Show resolved Hide resolved
src/array_like.rs Outdated Show resolved Hide resolved
src/array_like.rs Outdated Show resolved Hide resolved
src/array_like.rs Outdated Show resolved Hide resolved
src/array_like.rs Outdated Show resolved Hide resolved
src/array_like.rs Outdated Show resolved Hide resolved
src/array_like.rs Outdated Show resolved Hide resolved
@124C41p
Copy link
Contributor Author

124C41p commented Jul 7, 2023

I added some smaller comments inline, but the main point I am wonder now that I have look at the pull request is whether the converted case should be a plain ndarray::Array or rather a NumPy array as well. This would have the benefit of a more uniform API and would allow the result to be passed to functions expecting a NumPy array. I am not sure if the cost of creating a NumPy array versus an ndarray::Array is prohibitive though.

I think spelled out this would look something like

pub struct PyArrayLike<'py, T, D>(PyReadonlyArray<'py, T, D>);

impl Deref for PyArrayLike<'py, T, D> {
  type Target = PyReadonlyArray<'py, T, D>;

  ...
}

impl FromPyObject<'py> for PyArrayLike<'py, T, D> {
   // ... as before, just constructing `PyArray` ...
}

What do you think?

I like this idea, and I just published a commit implementing it. There might be one downside however: If the array is created by the np.asarray fallback, we only get a PyReadonlyArray, but we should actually own a PyArray. If the user calls .to_owned_array() afterwards, we are creating a copy of a copy. Maybe it would be better to define it that way?

pub struct PyArrayLike<T, D>(Cow<PyArray<T, D>>)

@adamreichold
Copy link
Member

If the array is created by the np.asarray fallback, we only get a PyReadonlyArray, but we should actually own a PyArray.

PyReadonlyArray<'py, T, D> is just a wrapper around a &'py PyArray<T, D> and derefs into that. A bare reference is as much ownership as Rust code can get for Python types. So the call to to_owned_array will do the same whether applied to PyReadonlyArray or &PyArray.

(Note that even constructors like PyArray::from_vec return only &PyArray, not an owned PyArray in the Rust sense since this type always lives on the Python heap.)

// }
// }

let numpy_module = get_array_module(py)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to cache the method reference using GILOnceCell.

Also, do we need to set the dtype keyword argument? I think this would try harder than your previous code would?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to cache the method reference using GILOnceCell.

That's nice! I was not aware of that Cell type.

Also, do we need to set the dtype keyword argument? I think this would try harder than your previous code would?

I think it is essential to set the dtype. E.g. if the input array is an ndarray of dtype int32 then np.asarray(...) wouldn't do anything, so we wouldn't be able to extract a PyReadonlyArray<f64, D> from it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. if the input array is an ndarray of dtype int32 then np.asarray(...) wouldn't do anything, so we wouldn't be able to extract a PyReadonlyArray<f64, D> from it.

But this is exactly what I meant, your direct-line code using .extract::<Vec<T>> will also not convert i32 to f64 if T is f64. Hence I would like to avoid making PyArrayLike do two different things, turning non-array storage in array storage and converting element types.

Copy link
Contributor Author

@124C41p 124C41p Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. if the input array is an ndarray of dtype int32 then np.asarray(...) wouldn't do anything, so we wouldn't be able to extract a PyReadonlyArray<f64, D> from it.

But this is exactly what I meant, your direct-line code using .extract::<Vec<T>> will also not convert i32 to f64 if T is f64. Hence I would like to avoid making PyArrayLike do two different things, turning non-array storage in array storage and converting element types.

Actually .extract::<Vec<f64>> does convert i32 to f64 and everything else which is meaningful (it will not convert f64 to i32 e.g.).

For my personal use case, the whole point of the PyArrayLike type is to obviate the need for calling np.asarray on Python side before calling a PyO3 function. So it should be able to match against anything which can be reasonably converted into the target type.

Copy link
Member

@adamreichold adamreichold Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually .extract::<Vec> does convert i32 to f64 and everything else which is meaningful (it will not convert f64 to i32 e.g.).

This particular case is an idiosyncrasy of PyFloat_AsDouble falling back to __index__ but FromPyObject will generally not do lossy conversions and coerce types, whereas

>>> np.asarray([1.5, 2.5, 3.5], dtype=np.int32)
array([1, 2, 3], dtype=int32)

happily throws away the fractional parts. As it is written now, passing [1.5, 2.5, 3.5] as PyArrayLike1<i32> will fall through the Vec extraction just to be made work by reaching the call to asarray with dtype set.

I am not saying, to coercing the data type has no usage, but the above seems inconsistent to me and if you really need lists of floating point numbers as arrays of integers I would propose separate types for the coercing and non-coercing variants and to leave out the .extract-based code path for the coercing one.

Not sure about reasonable names though? Since our MSRV allows const generics by now, we could even using a const COERCE: bool generic parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note that the (Python) users of my library cannot even see/manipulate the COERCE flag)

Having the element type as a generic parameter is a core design principle for the current rust-numpy and it definitely does chafe against Python API conventions, but it allows the Rust code to be statically checked and optimized.

In practice, this leads to entry points which dispatch into different instantiations of one or more generic functions which is also how a COERCE generic parameter could be reified as a function parameter on the Python side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the element type as a generic parameter is a core design principle for the current rust-numpy and it definitely does chafe against Python API conventions, but it allows the Rust code to be statically checked and optimized.

I was not criticizing that design at all. I just wanted to point out that the COERCE-flag is opaque to my users, so I would rather not put potentially "dangerous" code behind it.

I can only re-iterate: I do not want to maintain rust-numpy-specific semantics for PyArrayLike.

Ok, thanks for clarifying. But please understand that I don't see the point in developing a feature that I would probably never use. I would rather create my personal crate (I am not using internals of rust-numpy after all), and maybe publish it some day.

I am still very grateful for your patience and all your help: I certainly learned a lot from our discussions. Thanks a lot!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not criticizing that design at all. I just wanted to point out that the COERCE-flag is opaque to my users, so I would rather not put potentially "dangerous" code behind it.

I did not interpret your statements that way. I made that reference to point out that the T and D in a #[pyfunction] taking an PyArray<T, D> argument are similarly not visible to Python callers. If you want to provide a Python API which does handle them dynamically, you need multiple instantiations (or the IxDyn overhead). COERCE is no different in that regard.

I am still very grateful for your patience and all your help: I certainly learned a lot from our discussions. Thanks a lot!

Fair enough. Do you mind if I try to bring this into a shape I am happy to merge and maintain even if you do not use it yourself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perfectly fine for me. Please go ahead as you like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I pushed what I think I will go with. Still needs docs and examples though.

src/array_like.rs Outdated Show resolved Hide resolved
src/array.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

@124C41p 124C41p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity: You mentioned ob.downcast::<PyArray<T, D>>()?.readonly() being faster than ob.extract::<PyReadonlyArray<T, D>>(). But is there any downside? Is it less safe? Is there any situation where you would prefer the second over the first?

@adamreichold
Copy link
Member

Just out of curiosity: You mentioned ob.downcast::<PyArray<T, D>>()?.readonly() being faster than ob.extract::<PyReadonlyArray<T, D>>(). But is there any downside? Is it less safe? Is there any situation where you would prefer the second over the first?

Functionally they are equivalent. There is one performance paper cut that if you do not use the error returned by ob.extract::<PyReadonlyArray<T, D>>(), e.g. in

if let Ok(array) = ob.extract<PyReadonlyArray<T, D>>() {
  return Ok(Self(array));
}

then this is measurably slower than ob.downcast::<PyArray<T, D>>()?.readonly() because extract always returns Result<_, PyErr> which means the PyDowncastError returned by downcast (which extract uses internally) needs to be wrapped up in a PyErr, only to be thrown away again.

This is a general PyO3 issue which have not been able to resolve thus far. (FromPyObject will probably need an associated type for the error, but combining multiple FromPyObject implementations into new ones still needs to be possible. No rocket science, but a lot of code churn over the whole ecosystem to make it happen.)

You can also see rust-numpy trying to work around this here:

rust-numpy/src/array.rs

Lines 135 to 137 in 3843fa9

fn is_type_of(ob: &PyAny) -> bool {
Self::extract::<IgnoreError>(ob).is_ok()
}

@124C41p 124C41p requested a review from adamreichold July 11, 2023 20:41
@124C41p
Copy link
Contributor Author

124C41p commented Jul 11, 2023

The review request was a misclick, sorry.

@kngwyu
Copy link
Member

kngwyu commented Jul 12, 2023

@124C41p Thank you for this PR, and @adamreichold thank you for your encouraging tutoring.
I have one question about the design: is this really need to be struct? I mean, if we don't need C: Coerce information after extracting, maybe it's better to just have a function, say, extract_arraylike(obj, allow_typecast).

@124C41p
Copy link
Contributor Author

124C41p commented Jul 12, 2023

@kngwyu To my understanding having a struct is necessary for conveniently using it inside the header of a pyfunction. Also, since the struct has only two fields, one being a phantom type, it should get optimized away by the compiler anyway. So extracting a PyArrayLike should not be more expensive (in the fast path) than extracting a PyReadonlyArray.

@kngwyu
Copy link
Member

kngwyu commented Jul 12, 2023

I see, you're right that we definitely need them to use as function arguments.

@kngwyu
Copy link
Member

kngwyu commented Jul 14, 2023

BTW, I checked asarray and _array_fromobject_generic implementations, finding that it just tries some faster paths (e.g., obj is already an array) and then calls PyArray_CheckFromAny_int, which is an internal version of PyArray_CheckFromAny. So, I think that using PyArray_CheckFromAny is better because it is an exposed C-API, if we can handle faster paths correctly.
So now the problem is whether we can handle faster paths simply enough. The code was a bit complex to read for me, but it looks like only handles the cases where the given object is already a numpy array or an instance of an array subclass. Maybe it's worth considering to use it?

@adamreichold
Copy link
Member

BTW, I checked asarray and _array_fromobject_generic implementations, finding that it just tries some faster paths (e.g., obj is already an array) and then calls PyArray_CheckFromAny_int, which is an internal version of PyArray_CheckFromAny. So, I think that using PyArray_CheckFromAny is better because it is an exposed C-API, if we can handle faster paths correctly. So now the problem is whether we can handle faster paths simply enough. The code was a bit complex to read for me, but it looks like only handles the cases where the given object is already a numpy array or an instance of an array subclass. Maybe it's worth considering to use it?

I agree in principle and also started looking into which FFI code paths this ends up eventually. But as you say, it is somewhat convoluted and not at all obvious how to call PyArray_CheckFromAny. Therefore, I would suggest finishing this using the asarray fallback and then working on the FFI integration as an optimization follow-up. (Basically, this needs docs and examples to ready for review, but I have not had the time to write that yet.)

@kngwyu
Copy link
Member

kngwyu commented Jul 14, 2023

I agree, but let me confirm one thing...

not at all obvious how to call PyArray_CheckFromAny

We already have it, right? Do you mean that it's not obvious how to use it?

@adamreichold
Copy link
Member

We already have it, right? Do you mean that it's not obvious how to use it?

Yes, I do not refer to the signature, just how to use it correctly. Which code paths to handle ourselves. Which parameters to pass and so on.

@adamreichold
Copy link
Member

@kngwyu I added docs and examples and think this is good to go now. Could you have another look? Thanks!

Copy link
Member

@kngwyu kngwyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, only one comment about the document

src/array_like.rs Outdated Show resolved Hide resolved
Extracts a read-only reference if the correct NumPy array type is given.
Tries to convert the input into the correct type using `numpy.asarray`
otherwise.
@adamreichold adamreichold merged commit 1671b00 into PyO3:main Oct 7, 2023
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: implicit conversion
3 participants