Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parser and update readers for new options.yaml #30

Merged
merged 5 commits into from
Jan 19, 2024

Conversation

PicoCentauri
Copy link
Contributor

@PicoCentauri PicoCentauri commented Jan 13, 2024

This is still work in progress, but should already contain the full functionality of the options parser and the corresponding readers.

TODO

  • docs
  • test

📚 Documentation preview 📚: https://metatensor-models--30.org.readthedocs.build/en/30/

@PicoCentauri PicoCentauri force-pushed the train-set-parser branch 4 times, most recently from 901d2bc to 9a359c5 Compare January 15, 2024 16:51
Comment on lines +5 to +6
# Last position of the _self_ this entry defines that default options will be
# overwritten by this config.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand. What is self for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hydra thing _self_ represents the current config. Imagine you have the following default

defaults:
  - architecture: soap_bpnn
  - _self_

Then the current config overwrites options in soap_bpnn. But, if you have the following

defaults:
  - _self_
  - architecture: soap_bpnn

soap_bpnn will overwrite the current options. Of course this is not what we want but hydra has this possibility and will throw a waring if you do not define the _self_ entry in the default list. I will clarify this in the docs!

Comment on lines +244 to +261
subsets = torch.utils.data.random_split(
dataset=train_dataset,
lengths=[
fraction_train_set,
fraction_test_set,
fraction_validation_set,
],
generator=generator,
)

train_dataset = subsets[0]
if fraction_test_set and not fraction_validation_set:
test_dataset = subsets[1]
elif not fraction_validation_set and fraction_validation_set:
validation_dataset = subsets[1]
else:
test_dataset = subsets[1]
validation_dataset = subsets[2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm too lazy to think about the code, what happens here if the user specifies a validation set but then sets a test set of 0.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then 90% of the original train set goes into training and 10% of the original training set into validation. Does this makes sense for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, perfect


if target["stress"] and target["virial"]:
raise ValueError(
"Cannot add gradient displacement gradient for stress and virial!"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most people wouldn't understand this. How about "cannot use stress and virial at the same time" and/or "only one of the two is allowed at once"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I will change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I think I'll add an explicit gradient w.r.t. "strain" (i.e. -stress) in rascaline, to resolve an ambiguity in the "cell" gradients. A similar naming could be used internally here! (no comment on the user facing error message)

def _read_virial_stress_ase(
filename: str,
key: str,
is_virial: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this function should have a default for this argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See your point. For energy we could set it to "energy", "forces" "virial" and "stress". But, for this private function I would leave it mandatory.

Comment on lines +146 to +147
Labels(["direction_1"], torch.arange(3).reshape(-1, 1)),
Labels(["direction_2"], torch.arange(3).reshape(-1, 1)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still haven't agreed as a group on what a good naming convention is here... essentially the first one should index the three cell vectors, while the second one would index their individual xyz components. TBD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay let's keep this in mind. Can you open an issue once we merged this PR to keep this in mind.

if is_virial:
values *= -1
else: # is stress
values *= torch.tensor([f.cell.tolist() for f in frames])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think values should multiplied by the volume of the cell (ASE might have a function or property for that). I'm not sure what this operation is doing. It would also be great to raise a good error if there is no cell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh yes you are completely, right!

result = read_stress_ase(filename=filename, key="stress-3x3")

expected = torch.tensor(structures.info["stress-3x3"])
expected *= torch.tensor(structures.cell.tolist())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, shouldn't this be the cell volume (a single scalar)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

@PicoCentauri PicoCentauri marked this pull request as ready for review January 16, 2024 21:59
@PicoCentauri
Copy link
Contributor Author

Sorry for the long delay and thanks @frostedoyster for the review. Maybe @frostedoyster and @Luthaf you can add a look the tutorial if this clear and you understand how you can construct the dataset section. Also please check if this follows our convention that we have defined. Thanks in advance.

@PicoCentauri PicoCentauri force-pushed the train-set-parser branch 2 times, most recently from 4661be2 to deb9adf Compare January 18, 2024 12:43
@frostedoyster frostedoyster merged commit 08844c5 into main Jan 19, 2024
7 checks passed
@frostedoyster frostedoyster deleted the train-set-parser branch January 19, 2024 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants