-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The assumed matrix layout for correlation is unintuitive #81
Comments
I've been somewhat unsatisfied with this for a while, too. It's designed after /// Returns the covariance matrix for a 2-dimensional array of observations.
///
/// `axis` specifies the axis of the array to sum over. In other words, if `axis` is `Axis(0)`,
/// then each row is a single observation; if `axis` is `Axis(1)`, then each column
/// is a single observation. This is analogous to the `axis` parameter of [`ArrayBase::mean_axis`].
fn cov(&self, axis: Axis, ddof: A) -> Result<Array2<A>, EmptyInput>
where
A: Float + FromPrimitive,
{} This would also generalize well to inputs with more than 2 dimensions in the future. What do you think? |
On further thought, it seems that languages determine their However if you aren't willing to change the API in this way (which like I said, could wait until the next major release as it would be a breaking change), then we have two options. The first option is to use an axis parameter as you suggested (though ideally its type should be an Thoughts? |
The docs for the correlation methods say:
What this implicitly says is that "M should be a matrix with r rows, corresponding to random variables, and o columns, corresponding to observations". We know this because
ndarray
has an explicit definition for rows and columns, whereby the first axis refers to the rows and the second axis is called the column axis. For example refer tonrows
andncols
functions.However I find this assumption is counter-intuitive. The convention in my experience is to use the "tidy" layout which is that each row corresponds to an observation and each column corresponds to a variable. I refer here to Hadley Wickham's work, and this figure (e.g. here):
.
Also this is how R works:
Thirdly, in terms of the Rust data science ecosystem, note that
polars
(as far as I know, the best supported data frame library in Rust) outputs matricies with the same assumptions. If you create a DataFrame with 2 series (which correspond to variables) and 3 rows, and run.to_ndarray()
, you will get a (3, 2)ndarray
. Then when you call.cov()
on it, you will get something that is not the covariance matrix that you are after.One argument in the defence of the current method is
numpy.cov
, which makes the same assumption, as it takes:My suggestions is therefore to consider reversing the assumed dimensions for these methods in the next major (breaking) release. I realise that using
.t()
is not a difficult thing to do, but unfortunately forgetting to do this in your code will result in a valid matrix that may continue into downstream code without the user realising that it is not the correct covariance matrix. This happened to me and I'd like to spare other users from this issue.The text was updated successfully, but these errors were encountered: