Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_u_nk return current state of the file #115

Closed
xiki-tempula opened this issue Mar 7, 2021 · 1 comment
Closed

extract_u_nk return current state of the file #115

xiki-tempula opened this issue Mar 7, 2021 · 1 comment

Comments

@xiki-tempula
Copy link
Collaborator

xiki-tempula commented Mar 7, 2021

I'm working on a workflow for ABFE calculations #111 #114 and is currently working on the preprocessing.subsampling part.
The subsampling method dhdl needs to decorrelate the u_nk according to the column of the current state. However, the data frame returned by alchemlyb.parsing.gmx.extract_u_nk doesn't contain the information with regard to the current state.

I noticed that the alchemlyb.parsing.gmx.extract_u_nk does read state from the file so I wonder if it is possible for the extract_u_nk to return the current state of the dataframe.

I have several thoughts but I want to get the opinion from the community and possible issues with this.

The first is to set the state as metadata of the dataframe but not many people might know this usage. (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html)

u_k.attrs['state'] = state

The second is to set this information into the name since it is currently set to 'u_nk', which I think is not that useful

u_k.name = 'u_nk state: {}'.format(state)

The third option is to return the metadata directly, which will break the current API

def extract_u_nk(xvg, T):
    return u_k, {'state': state}

Obviously, one could also recover the state by using the row name
state = u_k.columns.values.tolist().index(u_k.index.values[0][1:])

@orbeckst
Copy link
Member

I don't like breaking the API and I am not a big fan of using column names for numerical values (although we might already be doing this somewhere).

The metadata approach seems elegant and does not break anything. If this is a feature of pd.DataFrame then we can use it because we explicitly use DataFrames as our internal data format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants