Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Request DOI metadata as CSL-JSON after BibTeX #1073

Merged
merged 21 commits into from
Apr 10, 2024

Conversation

agoose77
Copy link
Contributor

@agoose77 agoose77 commented Apr 8, 2024

The bug in #1072 appears to be caused by the key passed to citation-js:

@article{EFSA Panel on Nutrition, Novel Foods and Food Allergens (NDA)_Turck_Castenmiller_de Henauw_Hirsch‐Ernst_Kearney_Knutsen_Maciuk_Mangelsdorf_McArdle_et al._2019,
	title        = {Dietary reference values for chloride},
	author       = {EFSA Panel on Nutrition, Novel Foods and Food Allergens (NDA) and Turck, Dominique and Castenmiller, Jacqueline and de Henauw, Stefaan and Hirsch‐Ernst, Karen‐Ildico and Kearney, John and Knutsen, Helle Katrine and Maciuk, Alexandre and Mangelsdorf, Inge and McArdle, Harry J and Pelaez, Carmen and Pentieva, Kristina and Siani, Alfonso and Thies, Frank and Tsabouri, Sophia and Vinceti, Marco and Aggett, Peter and Fairweather‐Tait, Susan and Martin, Ambroise and Przyrembel, Hildegard and de Sesmaisons‐Lecarré, Agnès and Naska, Androniki},
	year         = 2019,
	month        = {Sep},
	journal      = {EFSA Journal},
	volume       = 17,
	number       = 9,
	doi          = {10.2903/j.efsa.2019.5779},
	issn         = {18314732, 18314732},
	url          = {https://data.europa.eu/doi/10.2903/j.efsa.2019.5779},
	place        = {IT},
	abstractnote = {This publication is linked to the following EFSA Supporting Publications articles: http://onlinelibrary.wiley.com/doi/10.2903/sp.efsa.2019.EN-1679/full, http://onlinelibrary.wiley.com/doi/10.2903/sp.efsa.2017.e15121/full This publication is linked to the following EFSA Journal article: http://onlinelibrary.wiley.com/doi/10.2903/j.efsa.2019.5778/full}
}

This is not valid BibTeX, and therefore does not parse. One solution is to request CSL-JSON via content negotiation. This trades one class of bugs (invalid BibTeX markup) for another (invalid CSL structures). I think this is a better part of the problem space to find oneself in; we can more easily attempt to fix-up bad CSL than we can bad BibTeX.

Indeed, it looks like using CSL-JSON from content-negotiation isn't foolproof either; some DOIs like The Higgs Boson Paper report a name instead of a literal, or define a type of journal-article instead of article-journal. It looks like crossref is returning "extended citeproc": CrossRef/rest-api-doc#580 rather than true CSL-JSON.

Knowing that, we can probably do a reasonable job of coercing the result to the CSL-JSON that citation-js expects, but for now this PR does the minimum from testing the original DOI and the Higg's DOI.

I'm hoping that this change is more robust than the BibTeX approach, but we will need to try more DOIs than I've thrown at it.

Fixes #1072

@agoose77 agoose77 changed the title wip: fix up citations 🐛 Request DOI metadata as CSL-JSON Apr 8, 2024
@agoose77 agoose77 force-pushed the agoose77/fix-citation-parsing branch from 62476ac to 23adc24 Compare April 8, 2024 14:53
@rowanc1
Copy link
Member

rowanc1 commented Apr 8, 2024

I don't have any data on which format is going to be better supported by cross-ref. I am pretty sure that bibtex has been around longer? My hesitation on bringing this on in the current state (or my understanding of it at least) is that it is really hard to know what our success rate is for the DOI parsing without this in the wild - I would rather an approach that improves/iterates the existing bibtex implementation rather than fully changes the implementation to using csl.

What about keeping the existing bibtex default and adding CSL content negotiation/parsing as a fallback if the bibtex fails to parse or is unavailable from doi.org? That should catch the case referenced and be a bit safer to roll out.

@agoose77
Copy link
Contributor Author

agoose77 commented Apr 9, 2024

Good point! I'll rebase on main with the recommended changes.

@agoose77 agoose77 force-pushed the agoose77/fix-citation-parsing branch from 23adc24 to 75b8a6a Compare April 9, 2024 12:04
@agoose77 agoose77 marked this pull request as ready for review April 9, 2024 14:14
@agoose77 agoose77 changed the title 🐛 Request DOI metadata as CSL-JSON 🐛 Request DOI metadata as CSL-JSON after BibTeX Apr 9, 2024
}

export async function parseCSLJSON(source: object[]): Promise<CSL[]> {
return Promise.resolve(cleanCSL(source));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't validate the CSL, but does try to "clean" it. Should we also use Cite.async here too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - in parseBibTeX, Cite.async gives us bibtex ->CSL, but here, it would just give us CSL -> CSL so maybe unnecessary and just "cleaning" is all that's required?

If all that is true, then I guess the only reason to use Cite.async is if doi.org gives an invalid CSL response...? Not totally out of the question I suppose... but then Cite.async is another mostly unnecessary API request?

Any other nuance to this decision?

return Promise.resolve(cleanCSL(source));
}

export async function getCitationRenderers(data: CSL[]): Promise<CitationRenderer> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this function creates renderers for CSL (i.e. offloads parsing)

Comment on lines +6 to +21
{
'container-title': 'Monthly Weather Review',
author: [
{ given: 'C. H. B.', family: 'PRIESTLEY' },
{ given: 'R. J.', family: 'TAYLOR' },
],
DOI: '10.1175/1520-0493(1972)100<0081:otaosh>2.3.co;2',
type: 'article-journal',
issue: '2',
issued: { 'date-parts': [[1972, 2]] },
page: '81-92',
publisher: 'American Meteorological Society',
title: 'On the Assessment of Surface Heat Flux and Evaporation Using Large-Scale Parameters',
volume: '100',
},
];
Copy link
Contributor Author

@agoose77 agoose77 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this to the subset of equal-fields common to both resolvers. I think that's OK; things like the URL and ISBN slightly differ, but I think that's a consequence of the different metadata that's registered.

@rowanc1
Copy link
Member

rowanc1 commented Apr 9, 2024

Works well in web mode, cache seems good and doesn't conflict. From my testing there is a current bug in writing out the bibtex: writeBibtexFromCitationRenderers errors out when there is a CSL input here.

image

I think this could/should be isolated into a test around falling back to CSL and then writing bibtex for the DOI of 10.2903/j.efsa.2019.5779.

Copy link
Collaborator

@fwkoch fwkoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great - keeping bibtex as the default but adding a CSL fallback, explicitly making CSL the internal representation.

I don't have any big concerns with the changes to exported functions - there are a couple dependent packages that will need to update, but giving citation-js-utils a major bump makes that pretty clear.

I do think we need to fix the writeBibtexFromCitationRenderers functionality before releasing this - Probably for now just a simple function we can call if the bibtex entry is not found buried in the cite._graph - just take the CSL and return a bibtex string with title/author/year/journal/doi filled in.

packages/myst-cli/src/transforms/dois.ts Outdated Show resolved Hide resolved
packages/citation-js-utils/src/index.ts Show resolved Hide resolved
}

export async function parseCSLJSON(source: object[]): Promise<CSL[]> {
return Promise.resolve(cleanCSL(source));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - in parseBibTeX, Cite.async gives us bibtex ->CSL, but here, it would just give us CSL -> CSL so maybe unnecessary and just "cleaning" is all that's required?

If all that is true, then I guess the only reason to use Cite.async is if doi.org gives an invalid CSL response...? Not totally out of the question I suppose... but then Cite.async is another mostly unnecessary API request?

Any other nuance to this decision?

packages/citation-js-utils/types/citation-js/index.d.ts Outdated Show resolved Hide resolved
packages/myst-cli/src/transforms/dois.ts Show resolved Hide resolved
packages/myst-cli/src/transforms/dois.ts Show resolved Hide resolved
@agoose77 agoose77 force-pushed the agoose77/fix-citation-parsing branch from 6f0751d to 68538b8 Compare April 10, 2024 09:34
@agoose77
Copy link
Contributor Author

Good spot on the BibTeX extraction; I'd assumed that we were using citation-js for this and hadn't noticed that instead we were pulling from the graph.

This PR now uses citation-js to build a BibTeX output. I'm assuming that we don't need to do anything to tweak the citation key, as it should match the id that citation-js reports (I think). We might need to be careful to ensure this.

In general, I think this PR represents a hardening of our BibTeX handling in addition to now providing a fallback for bad-BibTeX responses; we don't rely on implementation details from citation-js to store the original BibTeX, for example.

There may be some non-round-tripping of certain BibTeX entries, but I think that should be very uncommon; BibTeX is a core plugin for citation-js.

@agoose77 agoose77 closed this Apr 10, 2024
@agoose77 agoose77 reopened this Apr 10, 2024
@agoose77
Copy link
Contributor Author

@fwkoch

So - in parseBibTeX, Cite.async gives us bibtex ->CSL, but here, it would just give us CSL -> CSL so maybe unnecessary and just "cleaning" is all that's required?

If all that is true, then I guess the only reason to use Cite.async is if doi.org gives an invalid CSL response...? Not totally out of the question I suppose... but then Cite.async is another mostly unnecessary API request?

Any other nuance to this decision?

Yes, even the BibTeX method doesn't need to be async. I suppose I was anticipating that we might want to support async resolvers in future. I'll move them to sync for now.

Copy link
Collaborator

@fwkoch fwkoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the bibtex writing - using citation-js for that is much better. 🙂

I noticed a couple more things when testing today:

First, I don't think this warning needs to be raised:
image
I don't think the user will care about implementation details of the underlying citation data format?

Also, totally non-existent citations were still causing a crash:
image
This is related to return undefined vs throwing an error

I went ahead and fixed these up in a commit: 559cbef

Hopefully that all looks ok? Otherwise, I'm happy with everything.

@fwkoch fwkoch force-pushed the agoose77/fix-citation-parsing branch from dd9db8e to cec2d31 Compare April 10, 2024 23:14
@fwkoch
Copy link
Collaborator

fwkoch commented Apr 10, 2024

One more bug I found on tex exports:
image

cite nodes resolved from CSL-JSON were not getting a label. Fixed here: ba9d0b9

@fwkoch fwkoch merged commit 5c9338a into main Apr 10, 2024
4 checks passed
@fwkoch fwkoch deleted the agoose77/fix-citation-parsing branch April 10, 2024 23:24
@agoose77 agoose77 self-assigned this Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fatal error when apparently invalid doi is encountered
3 participants