Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping QCSchema in sync with QCElemental #68

Open
mattwelborn opened this issue Jan 23, 2020 · 20 comments · May be fixed by #77
Open

Keeping QCSchema in sync with QCElemental #68

mattwelborn opened this issue Jan 23, 2020 · 20 comments · May be fixed by #77

Comments

@mattwelborn
Copy link
Collaborator

mattwelborn commented Jan 23, 2020

QCSchema lags behind what's actually implemented AND DOCUMENTED in QCElemental. The QCSchema docs are where people look for info, so this creates a misleading impression about what QCSchema is/does in practice.

Some possible solutions:

  1. QCSchema/docs is always what you get from running .schema()/autodoc on the myriad models in QCElemental
  2. QCElemental models represent schema in development, and from time to time we concretize its state into release versions of QCSchema.
  3. QCElemental models are not allowed to change until QCSchema changes.

My favorite option is 1. I'm okay with 2. I think 3 is a bad idea.

@dgasmith @bennybp

@loriab
Copy link
Collaborator

loriab commented Jan 23, 2020

+1 option 1

extras field is effectively the "scratch space"/"in development" role

@ghutchis
Copy link
Collaborator

ghutchis commented Jan 23, 2020

As an external user, I would strongly push for choice 1.

Promoting option 1, IMHO sends a strong signal that this schema is only appropriate for QCElemental, and it's no longer intended as a cross-program interchange.

Option 3 sends the signal that the community input is driving the schema, but I can understand that's frustrating for QCelemental.

So I would strongly push for choice 2.

@loriab
Copy link
Collaborator

loriab commented Jan 23, 2020

Promoting option 1, IMHO sends a strong signal that this schema is only appropriate for QCElemental, and it's no longer intended as a cross-program interchange.

That's certainly not what I'd want signaled. Rather, I like option 1 for its single-source-of-truth property. Do you think calling qcel .schema() the docs/source with qcel as a reference implementation and doing the part of option 2 of periodically concretizing and publishing to QCSc versions would assuage that impression?

PRs suggesting changes to the data layout of qcel.models could be better marked with a "Schema" label and held open for comment to make sure changes don't sneak up on the community.

@ghutchis
Copy link
Collaborator

I feel pretty strongly about this.

This is an independent repo with a set of existing issues, PR, etc. for the schema. IMHO, QCElemental should be a client of this repo (i.e., it depends on the schema). The single source of truth should be here.

Frankly, I'm unlikely to use QCElement because every project already has similar functionality (cclib, Open Babel, Avogadro, etc.). Until I saw this issue, I frankly didn't know it existed.

This repo was created for the express purpose (and declared by its description) of being the home of the QC schema.

Consider also that https://molssi-qc-schema.readthedocs.io/en/latest/auto_topology.html is not a subset of the QCElemental docs.

Please don't make the schema subservient to another project. If you want to use QCElemental to work on development versions, that's fine. But the standard version should be declared here and IMHO, all issues, discussion relevant to the schema should be in this repo, not via some tag in a different repo.

@dgasmith
Copy link
Collaborator

I think a clear distinction here is that developing in QCElemental with option 1/2 doesn't necessitate using QCElemental as we will be exporting JSON-Schema to a repo (likely here). QCElemental is used to compose the schema, provide a Python-based structure for those who want it, or generate JSON-Schema like we do here. In the above scenarios I would much prefer 2) over 1) to for the comments listed above.

In general, we are finding QCElemental as a driving force for cross-program interchange because there is an implementation behind it that does something useful. A concrete schema implementation that has the bells and whistles needed forms a cornerstone of an ecosystem. We can now compute a dozen programs using QCSchema with QCEngine and I know of ~40M results computed with QCFractal using the Schema. There are now active learning programs, force field fitting toolkits, visualization tools, etc all built on top of the Schema and data repos like Fractal.

For example it would be great if Avogadro picked up QCEngine (which uses CCLib) instead of writing their own quantum chemistry compute routines. You probably need a few more programs with orbitals built in, but thats easily done. Come talk to us.

@cryos
Copy link
Collaborator

cryos commented Jan 24, 2020

I think if we are to drive adoption then option 3 is the best, with 2 being a more pragmatic second best. I don't think we are going to drop directly using QM codes in Avogadro, but adding something like QCEngine as another option would work assuming it offers the basic capabilities we need. I see the most value in codes adopting QC JSON natively, but there is a lot of value in something like QCEngine doing the translation.

As you know Chemical JSON was developed using something akin to option 1 from the start. I think it has utility, but it is something different to what we set out to achieve with QC JSON. It would be a shame to see the goal of forming a community schema recede, but we knew from the start it was tough to do. MolSSI is the organization with the funding and time to put in to efforts like this.

@ghutchis
Copy link
Collaborator

@dgasmith - while I agree that you don't have to use QCElemental under option one, that choice implies that the 'ground truth' for the schema falls to that project, not to this repo. I'm repeatedly telling you that sends the wrong message to the community.

If you want to get more community feedback, I'm happy to ping others from the meeting to come here and comment.

Doing something useful with the schema is incredibly important, as is continual development.

I have a pretty big pool of calculations myself (~20M) and Harvard Clean Energy is.. 150M? That shouldn't factor into where and how the schema is standardized.

IMHO option 2 sounds like something you'll accept, and I think that's the pragmatic way forward.

@loriab
Copy link
Collaborator

loriab commented Jan 24, 2020

My primary goals are to:

  • Not write schema docs & types twice, where one gets autotranslated into proper JSON and basic sanity checked with pydantic and the other sits here with perhaps my "number"/"float"-type mistakes uncaught.
  • Collect issues and discussion on schema where the opinions are. That seemed to have dwindled in this repo, but if people are here, so let the discussion be also.

So option 2 is fine by me.

@dgasmith
Copy link
Collaborator

I see the most value in codes adopting QC JSON natively, but there is a lot of value in something like QCEngine doing the translation.

I completely agree, but this is a very slow process as noted. Just to echo: QCEngine exists to help move the process along and provide demonstrable use cases.

I'm repeatedly telling you that sends the wrong message to the community.

I get that, but I am confused of why this is the case. A move to Elemental seems more like "MolSSI is building tools for a community built schema, let us build out a Python ecosystem for communication that can also be used in any language." Which seems like a reasonable message otherwise we are saying "here is a schema that has no uses or community".

I have a pretty big pool of calculations myself (~20M) and Harvard Clean Energy is.. 150M? That shouldn't factor into where and how the schema is standardized.

Definitely, which is why the number was a pretty minor part of a paragraph. I was attempting to paint a picture that schema is beginning to be accepted by some of the community by having useful libraries that support it.

Collect issues and discussion on schema where the opinions are. That seemed to have dwindled in this repo, but if people are here, so let the discussion be also.

I echo this quite strongly, Elemental has a strong user base that we just haven't been able to build here. If we can kick this repo along let us do it, but we so far haven't been successful at that.


I should note here that we have largely been taking strategy 3 so far where we add first to this repo and then to QCElemental. I think the only place where this isn't the case and we take strategy 2 is a geometry optimization input/output schema where we currently acknowledge that this repo is the single source of truth and we may have to change.

@mattwelborn
Copy link
Collaborator Author

mattwelborn commented Jan 24, 2020

Below is a proposal that tries to strike a balance between @loriab's points about correctness and de-duplication, and @ghutchis's points about community engagement, a single source of truth, and the QCSchema existing separate from QCElemental. It sort of falls between options 2 and 3.

Community input

1: The MolSSI/QCSchema repository MUST continue to exist.
1.1: Input from the community about QCSchema MAY be filed as issues on MolSSI/QCSchema.
1.1.1: These issues MUST be linked as issues in MolSSI/QCElemental.
1.1.2: The maintainers of MolSSI/QCSchema MAY encourage discussion to be moved to MolSSI/QCElemental if it is more model-related.
1.2: Input from the community about QCSchema MAY be filed as issues on MolSSI/QCElemental.
1.2.1: These issues MUST be linked as issues in MolSSI/QCSchema if they affect data layout/schema.
1.2.2: These issues MAY be linked as issues in MolSSI/QCSchema if they affect implementation details.
1.2.3: The maintainers of MolSSI/QCElemental MAY encourage discussion to be moved to MolSSI/QCSchema if it is more schema-related.

Releases

2: MolSSI/QCSchema MUST have versioned releases.

QCElemental models

3: Models in MolSSI/QCElemental MUST be marked as to whether they are part of QCSchema.
3.1: Models marked as part of QCSchema implemented in MolSSI/QCElemental/master are the single source of truth. The actual JSONSchema files in MolSSI/QCSchema/master MUST be generated from these models.
3.1.1: Accepted changes to the QCSchema (see section 1) MUST be implemented in MolSSI/QCElemental, and MolSSI/QCSchema MUST be subsequently updated.
3.1.2: PRs that change MolSSI/QCSchema MUST originate in MolSSI/QCElemental.
3.1.2.1: These PRs MUST be cross-linked as issues in MolSSI/QCSchema.

Documentation

3.1.2: The documentation of QCSchema in MolSSI/QCSchema MUST be auto-generated from the QCSchema models in MolSSI/QCElemental.
4: The default documentation on RTD MUST correspond to the latest released version of MolSSI/QCSchema.
4.1: The RTD corresponding to MolSSI/QCSchema/master MAY also be made available.


Items 3.1 and 3.1.1 are intended to establish a single source of truth while keeping MolSSI/QCSchema and MolSSI/QCElemental up to date and synchronized. Ideally, MolSSI/QCSchema would be the single source of truth, but models are a superset of the schema, and the thing that actually gets tested. As a result, we are likely to end up with correct models and incorrect schema.

@loriab
Copy link
Collaborator

loriab commented Jan 25, 2020

So now we do line-item voting, Security Council veto rules?

Just kidding. Matt's proposal sounds good to me. Only a couple items I want to make sure we're on the same page about:

  • 1.2.1 QCEl MUST cross link to QCSc on when QCEl issues/PRs arise affecting data layout and MAY cross link on those affecting implementation details. So I'd expect Result --> SingleResult QCElemental#155 to need an issue opened here but probably not Sparse Molecules QCElemental#169. For PRs, I can probably put in a linter that checks if PRs will affect the rendered schema, so a cross post can be made. Does this sound right or is it wanted that implementation issues/PRs cross-link, too?
  • Difference between schema stages.
    • preliminary schema, those being pre-implementation like Charges (AKA populations) #67, whose discussion and any PRs (about which we can be more relaxed about strict JSON formatting) belong here in QCSchema
    • operational schema, those that are adopted by QCSc and implemented in QCEl, can have issues in either repo (section 1 cross-link guidelines apply) but PRs should be solely in QCEl.
    • qcel schema, those like the OptimizationResult that are useful for QCEl and dependent projects but won't or haven't yet been broached to QCSc.

@ghutchis
Copy link
Collaborator

+1 I think this seems like a pragmatic approach.

@ghutchis
Copy link
Collaborator

@loriab - If you can add a linter or Github action to check if PRs affect the schema, that would be a great solution.

@cryos
Copy link
Collaborator

cryos commented Jan 25, 2020

Sounds like a reasonable and pragmatic approach to me too. I personally never thought this would be easy, but I am passionate about seeing it move forward and gain wider adoption as features are added.

@cryos cryos closed this as completed Jan 25, 2020
@cryos cryos reopened this Jan 25, 2020
@cryos
Copy link
Collaborator

cryos commented Jan 25, 2020

Apologies for accidentally closing this - mouse slippage...

@bennybp
Copy link
Contributor

bennybp commented Jan 25, 2020

I am not ignoring this on purpose. I do want to weigh in on this soon, however I am currently traveling and teaching (and there is a lot here to absorb). I will try to come up with some coherent thoughts tonight.

@awvwgk
Copy link

awvwgk commented Jan 27, 2020

Thanks for this issue! Similar to @ghutchis I didn't got the connection between QCElemental and QCSchema in the first place. It would be really helpful to have the list from above, or whatever the final form will be, as a contributing guideline for this repository or in some visable place. I for one didn't consider contributing to QCSchema so far because I didn't really knew how.

@mattwelborn
Copy link
Collaborator Author

So now we do line-item voting, Security Council veto rules?

Just kidding. Matt's proposal sounds good to me. Only a couple items I want to make sure we're on the same page about:

  • 1.2.1 QCEl MUST cross link to QCSc on when QCEl issues/PRs arise affecting data layout and MAY cross link on those affecting implementation details. So I'd expect MolSSI/QCElemental#155 to need an issue opened here but probably not MolSSI/QCElemental#169. For PRs, I can probably put in a linter that checks if PRs will affect the rendered schema, so a cross post can be made. Does this sound right or is it wanted that implementation issues/PRs cross-link, too?

  • Difference between schema stages.

    • preliminary schema, those being pre-implementation like Charges (AKA populations) #67, whose discussion and any PRs (about which we can be more relaxed about strict JSON formatting) belong here in QCSchema
    • operational schema, those that are adopted by QCSc and implemented in QCEl, can have issues in either repo (section 1 cross-link guidelines apply) but PRs should be solely in QCEl.
    • qcel schema, those like the OptimizationResult that are useful for QCEl and dependent projects but won't or haven't yet been broached to QCSc.

Yes to all of this. I have updated my earlier comment accordingly.

We should look into making a bot that creates cross-linked issues. For example, if an issue is made on MolSSI/QCSchema titled "Foo", then a corresponding issue gets made on MolSSI/QCElemental titled "QCSchema issue #123: Foo". This issue's body would then just be a link to MolSSI/QCSchema#123. On the QCEl side, we would do the same thing, but only for issues that have the "Schema" tag.

@loriab
Copy link
Collaborator

loriab commented Jan 28, 2020

We should look into making a bot that creates cross-linked issues. For example, if an issue is made on MolSSI/QCSchema titled "Foo", then a corresponding issue gets made on MolSSI/QCElemental titled "QCSchema issue #123: Foo". This issue's body would then just be a link to MolSSI/QCSchema#123. On the QCEl side, we would do the same thing, but only for issues that have the "Schema" tag.

I wasn't going to bring it up until it was proof-of-principle, but since Matt mentioned it, I'm working on MolSSI/QCElemental#204.

The notion is that there's a GHA that runs on every PR of qcel and generates the json schema files from a list of models (provenance, molecule, etc.). Then it runs a diff against qcsk (QCSchema) master and if there's a change, creates a branch (working up to this point) and PRs that to qcsk and returns red X to qcel. The PR here can be a place for discussion, any changes can be added to the qcel PR, and when auto PR is merged here, then qcel will show green check. This is my first foray into GHA, so I or GHA may still fail, but it's going easily so far. How does the plan sound?

A side issue is that the QCSchema repo is presently a python module. This is for historical reasons of testing and documentation. Since the former is handled at qcel and the latter can be reworked, what does everyone think of the qcsk repo being JSON files instead, like https://github.com/loriab/QCSchema/tree/py2json/qcschema_json ?

@coltonbh
Copy link

Adding here a link to a PR I created a while back that seeks to standardize the QCSchema models to a common interface. I'm in favor of implementing the models only once, and exporting that json schema from the pydantic models. What I propose in this (draft) PR would be a suggestion for what v2 of the schema could look like: MolSSI/QCElemental#287

Open to comments/suggestions. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants