Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation of footnoted references for training custom Grobid models #1171

Open
cboulanger opened this issue Sep 23, 2024 · 8 comments
Open

Comments

@cboulanger
Copy link

cboulanger commented Sep 23, 2024

I want to extract reference data from articles that use footnotes instead of a bibliography. The footnotes contain the references, mixed with additional commentary. Since Grobid was not trained on this kind of messy data, it does not perform very well when confronted with this type of source material.

I have a dataset of high-quality annotations in a different TEI format which I want to convert into the schema that Grobid uses for training. I have used the 'createTraining' command to produce the training files for the reference model and am currently trying to convert my data into something that fits the structure of the xml that I see there.

One major problem is of course that the bibl nodes in my data are not placed in the TEI/text/back/listBibl node similar to the existing training data, but in individual TEI/text/body/note nodes. The question is what Grobid expects to be trained with. Should a bibliographic section be "faked" from the given material or can the structure be kept? See an example.

The references.referenceSegmenter model actually recognizes footnotes very well but classifies them as bibl, i.e <bibl><label>39</label> ...</bibl> instead of the "TEI"-ish way <note n="39" type="footnote" place="bottom"><bibl>... </bibl></note>. Since Grobid is not for TEI annotation but for metadata extraction, I don't care too much but still the question is if my material needs to be transformed or if Grobid can (and should) be trained to differentiate bibliographies from footnotes this way.

Happy to hear your thoughts.

@cboulanger
Copy link
Author

To add to this: I assume that making Grobid learn the concept of footnotes containing full bibliographic references requires to retrain the page segmentation model...

@lfoppiano
Copy link
Collaborator

I think the footnotes are already recognised by the segmentation model. Do you have one or two PDF example that I can have a look?

IMHO the challenge it's more to understand when the references are in the footnotes and when they are not, and to pass the right piece through the bibliographic reference parser.

@kermitt2
Copy link
Owner

Hello @cboulanger !

A superficial support of bibliographical references as footnotes is to encode them as independent in the segmentation training file, not as footnotes:

bla bla^5 bla
</body>

<listBibl>5. See for example Boulanger, C. 2024. "The problem of encoding references in footnote". 
Journal of complex issues, Volume 1</listBibl>. 

<page>2</page>

<body>
bla bla

When processing a file, the reference areas, instead of appearing as one block (the bibliographical section), appear in multiple blocks (several foot notes) that will be combined to form a list of bibliographical references processed then as usual. These reference segments are further segmented, parsed and will appear together in the final TEI, not anymore as footnote. Nothing to change in Grobid to have that working.

The problems are that it's not working well when the reference is mixed with other text, as it is often the case in Humanties and Law, and we loose the fact that it was a footnote and that the footnote marker in text body was in fact a reference marker. For going beyond this superficial way of dealing with references in footnote, maybe we would need to introduce two types of foot notes in the segmentation model, the normal one and the "reference footnote" as a new label, that would trigger a different process.

@cboulanger
Copy link
Author

cboulanger commented Sep 25, 2024

Hi, thank you for your responses and feedback!

Unfortunately, the PDFs that I need to work with are not Open Access, but hopefully you have access to them through your institutions. Let me know if not, I could send them to you by email.

Let's take one example, having the DOI https://doi.org/10.1111/1467-6478.00057. This is an english-language article with endnotes. My annotations are originally for AnyStyle, so, for every PDF, I have an XML file and a text file where each line is tagged (for copyright reasons, I can only publish a truncated version).

AnyStyle uses a very simple format for its training files compared to Grobid, which is easy to annotate, the reason why I originally chose it. It has no real information on the page layout and could only be translated into a page segmentation GT using some clever heuristics. However, the bibliographic segmentation data can be translated relatively easily - I tried to make it as TEI-conformant as possible, using <bibl> elements (example). According to what I found in the TEI specs, footnote or endnotes are to be encoded as <note> elements (There are some problems with whitespace, which is inconsistently encoded, but let's ignore that for now). The <bibl> elements can then be further translated using existing tools into <biblStruct> (example).

In addition to this article (English, endnotes), I have prepared and manually checked three other annotations which are representative of the larger training corpus:

I do not have the resources to re-annotate Grobid training files from scratch. Instead, I want to be able to use my existing data. The pragmatic way would just be to mimick the structure Grobid has already learned, and it would be enough to just get the reference data out of new data. It would be nice, however to be able to teach Grobis some new tricks, and have the result be more TEI-conformant so that in the future, some more fine-grained analysis of scholarly articles was possible, for example, to keep the citation context (analysis of whether a footnote is supportive, contradicting, etc.). That's why I am unsure how to proceed at this moment.

@cboulanger
Copy link
Author

A similar question of "forward-compatibility" for ML-based TEI annotation concerns footnotes containing back-references such as "id, p. 56" or "See Doe, op. cit (n. 5), p 45", which carry no new bibliographic information as such, but annotating it would provide rich information that could be harvested in later analyses. I know that is out of the scope of what Grobid is made for, but if Grobid could be trained to recognize these patterns, that would really open some new research venues.

@cboulanger
Copy link
Author

cboulanger commented Oct 1, 2024

Hi, maybe it makes more sense to start small and instead of thinking about the footnotes in the page context, I should probably first focus on the main problem: whether Grobid can be trained to deal with messy strings that contain more than one reference. That has nothing to do with the footnote itself, which might as well contain a well-formatted reference that Grobid has no problem parsing.

Here are two examples of extremely messy reference strings, which I have passed to Grobid's "processCitation" service.

Example 1: English footnote

3 R. Goff, ‘The Search for Principle’ (1983) Proceeedings of the British Academy 169, at 171. This is an amplification of Dicey’s remark that ‘[b]y adequate study and careful thought whole departments of law can . . . be reduced to order and exhibited under the form of a few principles which sum up the effect of a hundred cases . . .’. A. Dicey, Can English Law be taught at the Universities? (1883) 20.

Result:

<biblStruct>
	<monogr>
		<title level="m" type="main">Proceeedings of the British Academy 169, at 171. This is an amplification of Dicey&#8217;s remark that &#8216;[b]y adequate study and careful thought whole departments of law can . . . be reduced</title>
		<author>
			<persName><forename type="first">R</forename><surname>Goff</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1983">1983</date>
		</imprint>
	</monogr>
	<note>The Search for Principle. to order and exhibited under the form of a few principles which sum up the effect of a hundred cases . . .&#8217;. A. Dicey, Can English Law be taught at the Universities? (1883) 20</note>
</biblStruct>

It recognizes the first one ok-ish but does not know how to deal with the comment on Dicey. The second one is just put into a note.

Example 2: German footnote

11 Dazu Grimshau, Comparative Sociology - In What Ways Different From Other Sociologies?, in: Armer/Grimshaw 3 (18). Auch der Oberbegriff „cross System comparison" wird vorgeschlagen, Tomasson, Introduction; Comparative Sociology — The State of the Art, in: Tomasson (Hrsg.), Comparative Studies in Sociology Vol. 1 (Greenwich, Conn. 1978) 1. — Über die Methoden interkultureller und internationaler Vergleiche ist inzwischen so viel geschrieben worden, daß nicht nur die Fülle des Materials schon wieder abschreckend wirkt, sondern daß es auch genügt, im Rahmen dieses Aufsatzes nur einige wichtige Aspekte anzusprechen. Bibliographien finden sich etwa bei Rokkan/Verba/Viet/Almasy 117 ff.; Vallier 423 ff.; Almasy/Balandier/Delatte, Comparative Survey Analysis — An Annotated Bibliography 1967 — 1973 (Beverly Hills, London 1976) sowie bei Marsh, Comparative Sociology (New York, Chicago, San Francisco, Atlanta 1967) 375 ff.

Grobid does not know what to do with that kind of footnote from hell:

<biblStruct>
	<analytic>
		<title level="a" type="main">&#220;ber die Methoden interkultureller und internationaler Vergleiche ist inzwischen so viel geschrieben worden, da&#223; nicht nur die F&#252;lle des Materials schon wieder abschreckend wirkt, sondern da&#223; es auch gen&#252;gt, im Rahmen dieses Aufsatzes nur einige wichtige Aspekte anzusprechen. Bibliographien finden sich etwa bei Rokkan/Verba/Viet/Almasy 117</title>
		<author>
			<persName><forename type="first">Dazu</forename><surname>Grimshau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Comparative Survey Analysis - An Annotated Bibliography 1967 - 1973</title>
		<title level="s">Comparative Studies in Sociology</title>
		<editor>
			<persName><forename type="first">/</forename><surname>Almasy</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">/</forename><surname>Balandier</surname></persName>
		</editor>
		<editor>
			<persName><surname>Delatte</surname></persName>
		</editor>
		<meeting><address><addrLine>Greenwich, Conn; Beverly Hills, London; New York, Chicago, San Francisco, Atlanta</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1967">1978. 1976. 1967</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page">375</biblScope>
		</imprint>
	</monogr>
	<note>sowie bei Marsh, Comparative Sociology</note>
</biblStruct>

Here is some gold standard that shows how the result (more or less) should look like, generated from this hand-annotated <bibl>.

Now the question is - do you think Grobid can be trained to recognize these kind of messy patterns as it is (requiring only the right kind of training data), or would that require changing the code?

@cboulanger
Copy link
Author

cboulanger commented Oct 4, 2024

Maybe this makes my point clearer: Here's an example with two input strings and the parsed output biblStruct (ignore the particular markup, I am not yet sure how to best encode this with TEI):

        <listBibl type="footnote" n="21">
            <desc type="input-full">21 C. Pateman, The Sexual Contract (1988). For further analysis of the marriage
                contract, see K. O&#8217;Donovan, Family Matters (1993), especially 43&#8211;59.
            </desc>
            <desc type="input-segmented">
                <bibl>21 C. Pateman, The Sexual Contract (1988).</bibl>
                <desc>For further analysis of the marriage contract, see</desc>
                <bibl>K. O&#8217;Donovan, Family Matters (1993), especially 43&#8211;59.</bibl>
            </desc>
            <biblStruct xmlns="http://www.tei-c.org/ns/1.0" n="26">
                <monogr>
                    <title level="m">The Sexual Contract</title>
                    <author>
                        <persName>
                            <forename>C.</forename>
                            <surname>Pateman</surname>
                        </persName>
                    </author>
                    <imprint>
                        <date>1988</date>
                    </imprint>
                </monogr>
            </biblStruct>
            <biblStruct xmlns="http://www.tei-c.org/ns/1.0" n="27">
                <monogr>
                    <title level="m">Family Matters</title>
                    <author>
                        <persName>
                            <forename>K.</forename>
                            <surname>O&#8217;Donovan</surname>
                        </persName>
                    </author>
                    <imprint>
                        <date>1993</date>
                    </imprint>
                </monogr>
            </biblStruct>
        </listBibl>

<desc type="input-full"> contains the raw string, with two references and commentary. In order for Grobid to be able to parse it, I need to extract the <bibl> elements from <desc type="input-segmented"> and have them parsed individually. I therefore need a way to get from the full raw string to a segmented string, in which the parseable elements are isolated and the noise removed. Do you think this requires adding an additional segmentation model to Grobid, and is that something you would be interested in adding?

@lfoppiano
Copy link
Collaborator

Hi @cboulanger and sorry for answering later. It took quite a bit to get the idea of the type of data that you are dealing with. I'm, however, not familiar with digital humanities type of citations.

IMHO, at first you could assess how much impact on the citation model the addition of your training data will have. I fear it might likely reduce the overall model performance.
However, it might be more pragmatic to start training a specialized model in humanistic citations from scratch and evaluate how it performs alone. This could also become a citation model flavor that could be triggered for specific documents via #1151.

Regarding your questions about segmenting the messy citation into blocks before parsing them, IMHO for what I've seen in your examples, you might get away without this additional step, but I'm not sure how variable and messy the data can get. 😄

Finally, regarding the recognition of footnote citations I would vote for updating the training data as Patrice suggested in an early comment, however this would require a relative effort and time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants