Many cultural institutions have made large digitized visual collections available online, often under permissible re-use licences. Creating interfaces for exploring and searching these collections is difficult, particularly in the absence of granular metadata. In this paper, we introduce a method for using state-of-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections. We show how our approach can create novel clustering and recommendation systems that avoid common pitfalls of methods based directly on visual embeddings. Of particular interest is the ability to offer concrete textual explanations of each recommendation without the need to preselect the features of interest. Together, these features can create a digital interface that is more open-ended and flexible while also being better suited to addressing privacy and ethical concerns. Through a case study using a collection of documentary photographs, we provide several metrics showing the efficacy and possibilities of our approach.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper102.md b/content/papers/paper102.md
new file mode 100644
index 0000000..c2ccab0
--- /dev/null
+++ b/content/papers/paper102.md
@@ -0,0 +1,56 @@
+
+
+
+
Greatest Hits Versus Deep Cuts: Exploring Variety in Setlists Across Artists and Musical Genres
Live music concert analysis provides an opportunity to explore cultural and historical trends. The art of set-list construction, of which songs to play, has many considerations for an artist, and the notion of how much variety different artists play is an interesting topic. Online communities provide rich crowd-sourced encyclopaedic data repositories of live concert set-list data, facilitating the potential for quantitative analysis of live music concerts. In this paper, we explore data acquisition and processing of musical artists' tour histories and propose an approach to analyse and explore the notion of variety, at individual tour level, at artist career level, and for comparisons between a corpus of artists from different musical genres. We propose notions of a shelf and a tail as a means to help explore tour variety and explore how they can be utilised to help define a single metric of variety at tour level, and artist level. Our analysis highlights the wide diversity among artists in terms of their inclinations toward variety, whilst correlation analysis demonstrates how our measure of variety remains robust across differing artist attributes, such as the number of tours and show lengths.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper104.md b/content/papers/paper104.md
new file mode 100644
index 0000000..31093c4
--- /dev/null
+++ b/content/papers/paper104.md
@@ -0,0 +1,56 @@
+
+
+
+
Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the 'Lancelot en prose'
+
+
+
(long paper)
+
Authors: Lucence Ing, Matthias Gille Levenson and Jean-Baptiste Camps
This study focuses on the problem of multilingual medieval text alignment, which presents specific challenges, due to the absence of modern punctuation in the texts and the non-standard forms of medieval languages. In order to perform the alignment of several witnesses from the multilingual tradition of the prose Lancelot, we first develop an automatic text segmenter based on BERT and then align the produced segments using Bertalign. This alignment is then used to produce stemmatological hypotheses, using phylogenetic methods. The aligned sequences are clustered independently by two human annotators and a clustering algorithm (DBScan), and the resulting variant tables submitted to maximum parsimony analysis, in order to produce trees. The trees are then compared and discussed in light of philological knowledge. Results tend to show that automatically clustered sequences can provide results comparable to those of human annotation.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper106.md b/content/papers/paper106.md
new file mode 100644
index 0000000..b822d06
--- /dev/null
+++ b/content/papers/paper106.md
@@ -0,0 +1,56 @@
+
+
+
+
Patterns of Quality: Comparing Reader Reception Across Fanfiction and Published Literature
+
+
+
(long paper)
+
Authors: Mia Jacobsen, Pascale Moreira, Kristoffer Nielbo and Yuri Bizzoni
Recent work on the textual features linked to literary quality has primarily focused on commercially published literature, such as canonical or best-selling novels, that are systematically filtered by editorial and market mechanisms. However, the biggest repositories of fiction texts currently in existence are free fanfiction websites, where fans post fictional stories about their favorite characters for the pleasure of writing and engaging with others. This makes them a particularly interesting domain to study the patterns of perceived quality ``in the wild'', where text-reader relations are less filtered. Moreover, since fanfiction is a community-built domain with its own conventions, comparing it to published literature can more generally provide insights into the reception and perceived quality of published literature itself. Taking a novel approach to the study of fanfiction, we observe whether three textual features associated with perceived literary quality in published texts are also relevant in the context of fanfiction. Using different reception proxies, we find that despite the differences of fanfiction from published literature, some ``patterns of quality'' associated with positive reception appear to hold similar effects in both of these contexts of literary production.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper110.md b/content/papers/paper110.md
new file mode 100644
index 0000000..9849dce
--- /dev/null
+++ b/content/papers/paper110.md
@@ -0,0 +1,56 @@
+
+
+
+
Steps Towards Mining Manuscript Images for Untranscribed Texts: A Case Study from the Syriac Collection at the Vatican Library
+
+
+
(long paper)
+
Authors: Luigi Bambaci, George Kiraz, Christine M. Roughan, Daniel Stökl Ben Ezra and Matthieu Freyder
Digital libraries and databases of texts are invaluable resources for researchers, yet their reliance on printed editions can lead to significant gaps and potentially exclude works without printed reproductions. The Simtho database of Syriac serves as a pertinent example: it is derived primarily from OCR of scholarly editions, but how representative are these of the language's extensive literary tradition, transmitted and preserved in manuscript form for centuries? Taking the Simtho database and a selection of the Vatican Library's Syriac manuscript collection as a case study, we propose a pipeline that aligns a corpus of e-texts with a set of digitised manuscript images, in order to ascertain the presence or absence of texts between the e-text and manuscript corpora and thus contribute to their enrichment. We delve into the complexities of this task, evaluating both effective tools for alignment and approaches to detect factors that can contribute to alignment failures. This case study is intended as a first step towards foundational methodologies applicable to larger-scale manuscript processing efforts.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper119.md b/content/papers/paper119.md
new file mode 100644
index 0000000..57abeca
--- /dev/null
+++ b/content/papers/paper119.md
@@ -0,0 +1,56 @@
+
+
+
+
On Classification with Large Language Models in Cultural Analytics
+
+
+
(long paper)
+
Authors: David Bamman, Kent Chang, Li Lucy and Naitian Zhou
In this work, we survey the way in which classification is used as a sensemaking practice in cultural analytics, and assess where large language models can fit into this landscape. We identify ten tasks supported by publicly available datasets on which we empirically assess the performance of LLMs compared to traditional supervised methods, and explore the ways in which LLMs can be employed for sensemaking goals beyond mere accuracy. We find that prompt-based LLMs are competitive with traditional supervised models for established tasks, but perform less well on de novo tasks. In addition, LLMs can assist sensemaking by acting as an intermediary input to formal theory testing.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper121.md b/content/papers/paper121.md
new file mode 100644
index 0000000..4ccb736
--- /dev/null
+++ b/content/papers/paper121.md
@@ -0,0 +1,56 @@
+
+
+
+
Promises from an Inferential Approach in Classical Latin Authorship Attribution
Applying stylometry to Authorship Attribution requires distilling the elements of an author's style sufficient to recognise their mark in anonymous documents. Often, this is accomplished by contrasting the frequency of selected features in the authors' works. A recent approach, CP2D, uses innovation processes to infer the author's identity, accounting for their propensity to introduce new elements. In this paper, we apply CP2D to a corpus of Classical Latin texts to test its effectiveness in a new context and explore the additional insight it can offer the scholar. We show its effectiveness on a corpus of classical Latin texts and how---moving beyond maximum likelihood---we can visualise the stylistic relationships and gather additional information on the relationships among documents.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper122.md b/content/papers/paper122.md
new file mode 100644
index 0000000..cba53ee
--- /dev/null
+++ b/content/papers/paper122.md
@@ -0,0 +1,56 @@
+
+
+
+
A Preliminary Analysis of ChatGPT's Poetic Style
+
+
+
(long paper)
+
Authors: Melanie Walsh, Anna Preus and Elizabeth Gronski
Generating poetry has become a popular application of LLMs, perhaps especially of OpenAI's widely-used chatbot ChatGPT. What kind of poet is ChatGPT? Does ChatGPT have its own poetic style? Can it successfully produce poems in different styles? To answer these questions, we prompt the GPT-3.5 and GPT-4 models to generate English-language poems in 24 different poetic forms and styles, about 40 different subjects, and in response to 3 different writing prompt templates. We then analyze the resulting 5.7k poems, comparing them to a sample of 3.7k poems from the Poetry Foundation and the Academy of American Poets. We find that the GPT models, especially GPT-4, can successfully produce poems in a range of both common and uncommon English-language forms in superficial yet noteworthy ways, such as by producing poems of appropriate lengths for sonnets (14 lines), villanelles (19 lines), and sestinas (39 lines). But the GPT models also exhibit their own distinct stylistic tendencies, both within and outside of these specific forms. Our results show that GPT poetry is much more constrained and uniform than human poetry, showing a strong penchant for rhyme, quatrains (4-line stanzas), iambic meter, first-person plural perspectives (we, us, our), and specific vocabulary like ``heart,'' ``embrace,'' ``echo,'' and ``whisper.''
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper123.md b/content/papers/paper123.md
new file mode 100644
index 0000000..a5a16a7
--- /dev/null
+++ b/content/papers/paper123.md
@@ -0,0 +1,56 @@
+
+
+
+
Deciphering Still Life Artworks with Linked Open Data
The still life genre is a good example of how even the simplest elements depicted in an artwork can be carriers of deeper, symbolic meanings that influence the overall artistic interpretation of it. In this paper, we present an ongoing study on the use of linked open data (LOD) to quantitatively analyze the symbolic meanings of still life paintings. In particular, we propose two different experiments based on (i) the theory of the art historian Bergström, and (ii) the impact of the Floriography movement in still life. To do so, we extract and combine data from Wikidata, HyperReal, IICONGRAPH, and the ODOR dataset. This work shows promising results about the use of LOD for art-historical quantitative research, as we are able to confirm Bergström's theory and to pinpoint outliers in the Floriography context that can be the objects of specific, qualitative studies. We conclude the paper by reflecting on the current limitations surrounding art-historical data.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper124.md b/content/papers/paper124.md
new file mode 100644
index 0000000..4afd4cc
--- /dev/null
+++ b/content/papers/paper124.md
@@ -0,0 +1,56 @@
+
+
+
+
Once More, With Feeling: Measuring Emotion of Acting Performances in Contemporary American Film
Narrative film is a composition of writing, cinematography, editing, and performance. While much computational work has focused on the writing or visual style in film, we conduct in this paper a computational exploration of acting performance. Applying speech emotion recognition models and a variationist sociolinguistic analytical framework to a corpus of popular, contemporary American film, we find narrative structure, diachronic shifts, and genre- and dialogue-based constraints located in spoken performances.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper128.md b/content/papers/paper128.md
new file mode 100644
index 0000000..455bedc
--- /dev/null
+++ b/content/papers/paper128.md
@@ -0,0 +1,56 @@
+
+
+
+
Automatic Translation Alignment Pipeline for Multilingual Digital Editions of Literary Works
This paper investigates the application of translation alignment algorithms in the creation of a Multilingual Digital Edition (MDE) of Alessandro Manzoni's Italian novel I promessi sposi (``The Betrothed''), with translations in eight languages (English, Spanish, French, German, Dutch, Polish, Russian and Chinese) from the 19th and 20th centuries. We identify key requirements for the MDE to improve both the reader experience and support for translation studies. Our research highlights the limitations of current state-of-the-art algorithms when applied to the translation of literary texts and outlines an automated pipeline for MDE creation. This pipeline transforms raw texts into web-based, side-by-side representations of original and translated texts with different rendering options. In addition, we propose new metrics for evaluating the alignment of literary translations and suggest visualization techniques for future analysis.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper13.md b/content/papers/paper13.md
new file mode 100644
index 0000000..42a46e5
--- /dev/null
+++ b/content/papers/paper13.md
@@ -0,0 +1,56 @@
+
+
+
+
Beyond the Register: Demographic Modeling of Arrest Patterns in 1879-1880 Brussels
+
+
+
(long paper)
+
Authors: Folgert Karsdorp, Mike Kestemont and Margo De Koster
Unseen species models from ecology have recently been applied to censored historical cultural datasets to estimate unobserved populations. We extend this approach to historical criminology, analyzing the police registers of Brussels' Amigo prison (1879-1880) using the Generalized Chao estimator. Our study aims to quantify the `dark number' of unarrested perpetrators and model demographic biases in policing efforts. We investigate how factors such as age, gender, and origin influence arrest vulnerability. While all examined covariates contribute positively to our model, their small effect sizes limit the model's predictive performance. Our findings largely align with prior historical scholarship but suggest that demographic factors alone may insufficiently explain arrest patterns. The Generalized Chao estimator modestly improves population size estimates compared to simpler models. However, our results indicate that more refined models or additional data may be necessary for robust estimates in historical criminological studies. This work contributes to the growing field of computational methods in humanities research and offers insights into the challenges of quantifying hidden populations in historical datasets.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper130.md b/content/papers/paper130.md
new file mode 100644
index 0000000..0c27902
--- /dev/null
+++ b/content/papers/paper130.md
@@ -0,0 +1,56 @@
+
+
+
+
Characterizing the Subversion of Social Relationships on Television
Television is often seen as a site for subcultural identification and subversive fantasy, including in queer cultures. How might we measure subversion, or the degree to which the depiction of social relationship between a dyad (e.g. two characters who are colleagues) deviates from its typical representation on TV? To explore this question, we introduce the task of stereotypic relationship extraction. Built on cognitive stylistics, linguistic anthropology, and dialogue relation extraction, in this paper, we attempt to model the cognitive process of stereotyping TV characters in dialogic interactions: given a dyad, we want to predict: what social relationship do the speakers exhibit through their words? Subversion is then characterized by the discrepancy between the distribution of the model's predictions and the ground truth labels. To demonstrate the usefulness of this task and gesture at a methodological intervention, we enclose four case studies to characterize the representation of queer relationalities in the Big Bang Theory, Frasier, and Gilmore Girls as we explore the suspicious and reparative modes of reading with our computational methods.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper132.md b/content/papers/paper132.md
new file mode 100644
index 0000000..98b2c8d
--- /dev/null
+++ b/content/papers/paper132.md
@@ -0,0 +1,56 @@
+
+
+
+
Treating Games as Plays? Computational Approaches to the Detection Scenes of Game Dialogs
+
+
+
(short paper)
+
Authors: Martin Schlenk, Thomas Efer and Manuel Burghardt
Digital games are a complex multimodal phenomenon that is examined in a variety of ways by the highly interdisciplinary field of game studies. In this article, we focus on the structural aspect of the diegetic language of games and examine the extent to which established methods of computational drama analysis can also be successfully applied to digital games. Initial experiments show that both games and drama texts have an inventory of characters that drive the plot forward. In dramas, this plot is usually subdivided into individual acts and scenes. In games, however, such systematic segmentation is the exception rather than the rule, or if it is present, it is implemented very differently in different games. In this paper, we therefore focus on exploring alternative ways of making scene-like structures in game dialogs identifiable with the help of computers. As a result of these experiments, exciting future perspectives emerge that raise the question of whether computer-aided methods of scene recognition, which are inspired by media such as games and films, can also be applied to classical dramas in the future in order to fundamentally question their historical-editorial scene classification.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper135.md b/content/papers/paper135.md
new file mode 100644
index 0000000..218736b
--- /dev/null
+++ b/content/papers/paper135.md
@@ -0,0 +1,56 @@
+
+
+
+
Early Modern Book Catalogues and Multilingualism: Identifying Multilingual Texts and Translations Using Titles
With this paper we aim to assess whether Early Modern book titles can be exploited to track two aspects of multilingualism in book publishing: publications featuring multiple languages and the distinction between editions of works in their original language and in translation. To this scope we leverage the manually annotated language information available in two book catalogs: the Collectio Academica Antiqua, recording publications of scholars of the Old University of Leuven (1425-1797) and a subset of the Eighteenth Century Collections Online, namely publications of Ancient Greek and Latin works. We evaluate three different approaches: we train a simple tf-idf based support vector classifier, we fine-tune a multilingual transformer model (BERT) and we use a few-shot approach with a pre-trained sentence transformer model. In order to get a better understanding of the results, we make use of SHAP, a library for explaining the output of any machine Learning model. We conclude that while the few-shot prediction is not currently usable for this task, the tf-idf approach and BERT fine-tuning are comparable and both usable. BERT shows better results for the task of identifying translations and when generalizing across different datasets.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper137.md b/content/papers/paper137.md
new file mode 100644
index 0000000..99eaf74
--- /dev/null
+++ b/content/papers/paper137.md
@@ -0,0 +1,56 @@
+
+
+
+
On the Unity of Literary Change: The Development of Emotions in German Poetry, Prose, and Drama between 1850 and 1920 as a Test Case
+
+
+
(long paper)
+
Authors: Leonard Konle, Merten Kröncke, Fotis Jannidis and Simone Winko
In this study, we use the development of emotions in German-language poetry, drama, and prose from 1850 to 1920 to informally test three hypotheses about literature: (1) Literature is a unified field, and therefore genres develop similarly at the same time. (2) The development of literature is led by one genre while the others follow. (3) The three main genres have very different developments without any relation to each other. We look at the development of emotions in these genres in general, and then at more fine-grained levels: polarity, six groups of emotions, and the group of love emotions. In the end, our data cannot confirm any of these hypotheses, but do show a closer relationship between poetry and prose, while drama shows a very distinct development. Only in some specific cases, such as the representation of lust and of love, can we see a closer relationship between the genres in general.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper141.md b/content/papers/paper141.md
new file mode 100644
index 0000000..91a81b7
--- /dev/null
+++ b/content/papers/paper141.md
@@ -0,0 +1,56 @@
+
+
+
+
Computational Segmentation of Wayang Kulit Video Recordings using a Cross-Attention Temporal Model
+
+
+
(short paper)
+
Authors: Shawn Hong Wei Liew and Miguel Escobar Varela
We report preliminary findings on a novel approach to automatically segment Javanese wayang kulit (traditional leather puppet) performances using computational methods. We focus on identifying comic interludes, which have been the subject of scholarly debate regarding their increasing duration. Our study employs action segmentation techniques from a Cross-Attention Temporal Model, adapting methods from computer vision to the unique challenges of wayang kulit videos. We manually labelled 100 video recordings of performances to create a dataset for training and testing our model. These videos, which are typically 7 hours long, were sampled from our comprehensive dataset of 12,638 videos uploaded to a video platform between 03 Jun 2012 and 30 Dec 2023. The resulting algorithm achieves an accuracy of 89.06 % in distinguishing between comic interludes and regular performance segments, with F1-scores of 96.53 %, 95.91 %, and 92.47 % at overlapping thresholds of 10 %, 25 %, and 50 % respectively. This work demonstrates the potential of computational approaches in analyzing traditional performing arts and other video material, offering new tools for quantitative studies of audiovisual cultural phenomena, and provides a foundation for future empirical research on the evolution of wayang kulit performances.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper15.md b/content/papers/paper15.md
new file mode 100644
index 0000000..5a39552
--- /dev/null
+++ b/content/papers/paper15.md
@@ -0,0 +1,56 @@
+
+
+
+
Abbreviation Application: A Stylochronometric Study of Abbreviations in the Oeuvre of Herne’s Speculum Scribe
This research examines the Carthusian monastery of Herne, a major cultural hotspot during the Middle Ages. Between 1350 and 1400, the monks residing in Herne produced an impressive 46 production units, with 40 of them written in the Middle Dutch vernacular. Focusing on the monastery's most productive scribe, known as the Speculum Scribe, this case study employs methods from the field of scribal modelling to achieve two main objectives: first, to evaluate the potential for chronologically ordering the Speculum Scribe’s works based on his use of abbreviations, and second, to investigate whether there was a convergence in scribal practices, such as the use of abbreviations, among the scribes living in Herne. Although a complete chronological order of the Speculum Scribe's works could not be determined, we were able to establish his first work. Furthermore, the findings show evidence that cautiously supports the second goal, suggesting that the scribes in Herne indeed converged in their scribal habits by learning from each other.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper17.md b/content/papers/paper17.md
new file mode 100644
index 0000000..6f74b59
--- /dev/null
+++ b/content/papers/paper17.md
@@ -0,0 +1,56 @@
+
+
+
+
Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP
Despite the prevalence and historical importance of maps in digital collections, current methods of navigating and exploring map collections are largely restricted to catalog records and structured metadata. In this paper, we explore the potential for interactively searching large-scale map collections using natural language inputs (``maps with sea monsters''), visual inputs (i.e., reverse image search), and multimodal inputs (an example map + ``more grayscale''). As a case study, we adopt 562,842 images of maps publicly accessible via the Library of Congress's API. To accomplish this, we use the mulitmodal Contrastive Language-Image Pre-training (CLIP) machine learning model to generate embeddings for these maps, and we develop code to implement exploratory search capabilities with these input strategies. We present results for example searches created in consultation with staff in the Library of Congress's Geography and Map Division and describe the strengths, weaknesses, and possibilities for these search queries. Moreover, we introduce a fine-tuning dataset of 10,504 map-caption pairs, along with an architecture for fine-tuning a CLIP model on this dataset. To facilitate re-use, we provide all of our code in documented, interactive Jupyter notebooks and place all code into the public domain. Lastly, we discuss the opportunities and challenges for applying these approaches across both digitized and born-digital collections held by galleries, libraries, archives, and museums.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper18.md b/content/papers/paper18.md
new file mode 100644
index 0000000..67ca8d1
--- /dev/null
+++ b/content/papers/paper18.md
@@ -0,0 +1,56 @@
+
+
+
+
Quantifying Linguistic and Cultural Change in China, 1900-1950
This paper presents a quantitative approach to studying linguistic and cultural change in China during the first half of the twentieth century, a period that remains understudied in computational humanities research. The dramatic changes in Chinese language and culture during this time call for greater reflection on the tools and methods used for text analysis. This preliminary study offers a framework for analyzing Chinese texts from the late nineteenth and twentieth centuries, demonstrating how established methods such as word counts and word embeddings can provide new historical insights into the complex negotiations between Western modernity and Chinese cultural discourse.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper19.md b/content/papers/paper19.md
new file mode 100644
index 0000000..2ff4fb6
--- /dev/null
+++ b/content/papers/paper19.md
@@ -0,0 +1,56 @@
+
+
+
+
Literary Time Travel: Distinguishing Past and Contemporary Worlds in Danish and Norwegian Fiction
+
+
+
(long paper)
+
Authors: Jens Bjerring-Hansen, Ali Al-Laith, Daniel Hershcovich, Alexander Conroy and Sebastian Ørtoft Rasmussen
The classification of historical and contemporary novels is a nuanced task that has traditionally relied on expert literary analysis. This paper introduces a novel dataset comprising Danish and Norwegian novels from the last 30 years of the 19 th century, annotated by literary scholars to distinguish between historical and contemporary works. While this manual classification is time-consuming and subjective, our approach leverages pre-trained language models to streamline and potentially standardize this process. We evaluate their effectiveness in automating this classification by examining their performance on titles and the first few sentences of each novel. After fine-tuning, the models show good performance but fail to fully capture the nuanced understanding exhibited by literary scholars. This research underscores the potential and limitations of NLP in literary genre classification and suggests avenues for further improvement, such as incorporating more sophisticated model architectures or hybrid methods that blend machine learning with expert knowledge. Our findings contribute to the broader field of computational humanities by highlighting the challenges and opportunities in automating literary analysis.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper20.md b/content/papers/paper20.md
new file mode 100644
index 0000000..1dceb1a
--- /dev/null
+++ b/content/papers/paper20.md
@@ -0,0 +1,56 @@
+
+
+
+
Viability of Zero-Shot Classification and Search of Historical Photos
+
+
+
(long paper)
+
Authors: Erika Maksimova, Mari-Anna Meimer, Mari Piirsalu and Priit Järv
Multimodal neural networks are models that learn concepts in multiple modalities. The models can perform tasks like zero-shot classification: associating images with textual labels without specific training. This promises both easier and more flexible use of digital photo archives, e.g. annotating and searching. We investigate whether existing multimodal models can perform these tasks, when the data differs from the typical computer vision training sets, on historical photos from a cultural context outside the English speaking world.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper21.md b/content/papers/paper21.md
new file mode 100644
index 0000000..8b26a1d
--- /dev/null
+++ b/content/papers/paper21.md
@@ -0,0 +1,56 @@
+
+
+
+
The Birth of French Orthography: A Computational Analysis of French Spelling Systems in Diachrony
The 17th~c. is crucial for the French language, as it sees the creation of a strict orthographic norm that largely persists to this day. Despite its significance, the history of spelling systems remains however an overlooked area in French linguistics for two reasons. On the one hand, spelling is made up of micro-changes which requires a quantitative approach, and on the other hand, no corpus is available due to the interventions of editors in almost all the texts already available. In this paper, we therefore propose a new corpus allowing such a study, as well as the extraction and analysis tools necessary for our research. By comparing the text extracted with OCR and a version automatically aligned with contemporary French spelling, we extract the variant zones, we categorise these variants, and we observe their frequency to study the (ortho)graphic change during the 17th century.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper30.md b/content/papers/paper30.md
new file mode 100644
index 0000000..ab342fc
--- /dev/null
+++ b/content/papers/paper30.md
@@ -0,0 +1,56 @@
+
+
+
+
Does Context Matter? Enhancing Handwritten Text Recognition with Metadata in Historical Manuscripts
The digitization of historical manuscripts has significantly advanced in recent decades, yet many documents remain as images without machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into text, facilitating large-scale analysis of historical collections. In 2024, the CATMuS Medieval dataset was released, featuring extensive diachronic coverage and a variety of languages and script types. Previous research indicated that model performance degraded on the best manuscripts over time as more data was incorporated, likely due to over-generalization. This paper investigates the impact of incorporating contextual metadata in training HTR models using the CATMuS Medieval dataset to mitigate this effect. Our experiments compare the performance of various model architectures, focusing on Conformer models with and without contextual inputs, as well as Conformer models trained with auxiliary classification tasks. Results indicate that Conformer models utilizing semantic contextual tokens (Century, Script, Language) outperform baseline models, particularly on challenging manuscripts. The study underscores the importance of metadata in enhancing model accuracy and robustness across diverse historical texts.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper35.md b/content/papers/paper35.md
new file mode 100644
index 0000000..e284f5e
--- /dev/null
+++ b/content/papers/paper35.md
@@ -0,0 +1,56 @@
+
+
+
+
Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking
+
+
+
(long paper)
+
Authors: Chahan Vidal-Gorène, Clément Salah, Noëmie Lucas, Aliénor Decours-Perez and Antoine Perrier
Recent advancements in handwritten text recognition (HTR) for historical documents have demonstrated high performance on cursive Arabic scripts, achieving accuracy comparable to Latin scripts. The initial RASAM dataset, focused on three Arabic Maghribi manuscripts, facilitated rapid coverage of new documents via fine-tuning. However, HTR application for Arabic scripts remains constrained due to the vast diversity in spellings, ambiguities, and languages. To overcome these challenges, we present RASAM~2, an extended dataset with 3,750 lines from 15 manuscripts in the BULAC library, showcasing various hands, layouts, and texts in Arabic Maghribi script. RASAM~2 aims to establish a new benchmark for HTR model training for both Maghribi and Oriental scripts, covering text recognition and layout analysis. Preliminary experiments using a word-based CRNN approach indicate significant model versatility, with a nearly 40 % reduction in Character Error Rate (CER) across new in-domain and out-of-domain manuscripts.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper36.md b/content/papers/paper36.md
new file mode 100644
index 0000000..33961a3
--- /dev/null
+++ b/content/papers/paper36.md
@@ -0,0 +1,56 @@
+
+
+
+
Global Coherence, Local Uncertainty: Towards a Theoretical Framework for Assessing Literary Quality
+
+
+
(short paper)
+
Authors: Yuri Bizzoni, Pascale Moreira and Kristoffer Nielbo
A theoretical framework for evaluating literary quality through analyzing narrative structures using simplified narrative representations in the form of story arcs is presented. This framework proposes two complementary models: the first employs Approximate Entropy to measure local unpredictability, while the second utilizes fractal analysis to assess global coherence. When applied to a substantial corpus of 9,089 novels, the findings indicate that narratives characterized by high literary quality, as indicated by reader ratings, exhibit a balance of local unpredictability and global coherence. This dual approach provides a formal and empirical basis for assessing literary quality and emphasizes the importance of considering intrinsic properties and reader perception in literary studies.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper39.md b/content/papers/paper39.md
new file mode 100644
index 0000000..7ec7060
--- /dev/null
+++ b/content/papers/paper39.md
@@ -0,0 +1,56 @@
+
+
+
+
Epistemic Capture through Specialization in Post-World War II Parliamentary Debate
This study examines specialization in Dutch Lower House debates between 1945 and 1994. We study how specialization translates in the phenomenon of ``epistemic capture'' in democratic politics. We combine topic modeling, network analysis and community detection to complement lexical ``distant reading'' approaches the history of political ideas with a network-based analysis that illuminates political-intellectual processes. We demonstrate how the breadth of political debate declines as its specialist depth increases. To study this transformation, we take a multi-level approach. At the (institutional) macro-level, network modularity shows an increase in the modularity of topic linkage networks, indicating growing specialization post-1960, linked to institutional reforms. At the (political) meso-level, we similarly observe specialization in node neighborhood stability, but also variation as the consequence of ideological and party political change. Lastly, micro-level analysis reveals persistent thematic communities tied to increasingly stable groups of individuals, revealing how policy domains and politicians are captured in ossified specialisms. As such, this study provides new insights into the development of twentieth-century political debate and emergent tensions between pluralism and specialism.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper42.md b/content/papers/paper42.md
new file mode 100644
index 0000000..199e212
--- /dev/null
+++ b/content/papers/paper42.md
@@ -0,0 +1,56 @@
+
+
+
+
Computational Paleography of Medieval Hebrew Scripts
+
+
+
(short paper)
+
Authors: Berat Kurar-Barakat, Daria Vasyutinsky-Shapira, Sharva Gogawale, Mohammad Suliman and Nachum Dershowitz
We present ongoing work as part of an international multidisciplinary project, called MiDRASH, on the computational analysis of medieval manuscripts. We focus here on clustering manuscripts written in Ashkenazi square script using a dataset of 206 pages from 59 manuscripts. Collaborating with expert paleographers, we identified ten critical features and trained a multi-label CNN, achieving high accuracy in feature prediction. This should make it possible to computationally predict the subclusters already known to paleographers and those yet to be discovered. We identified visible clusters using PCA and ^2 feature selection. In future work, we aim to enhance feature extraction using deep learning algorithms and provide computational tools to ease paleographers' work. We plan to develop new methodologies for analyzing Hebrew scripts and refining our understanding of medieval Hebrew manuscripts.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper46.md b/content/papers/paper46.md
new file mode 100644
index 0000000..cf224b4
--- /dev/null
+++ b/content/papers/paper46.md
@@ -0,0 +1,56 @@
+
+
+
+
Models of Literary Evaluation and Web 2.0: An Annotation Experiment with Goodreads Reviews
In the context of the Web 2.0, user-genrated reviews are becoming more and more prominent. The particular case of book reviews, often shared through digital social reading platforms such as Goodreads or Wattpad, is of particular interest, in that it offers scholars data regarding literary reception of unprecedented size and diversity. In this paper, we test whether the evaluative criteria employed in Goodreads reviews can be included in the framework of traditional literary criticism, by combining literary theory and computational methods. Our model, based on the work of von Heydebrand and Winko, is first tested through the practice of heuristic annotation. The generated dataset is then used to train a Tranformer-based classifier. Last, we compare the performance of the latter with that obtained by instructing a Large Language Model, namely GPT-4.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper49.md b/content/papers/paper49.md
new file mode 100644
index 0000000..5ce7e94
--- /dev/null
+++ b/content/papers/paper49.md
@@ -0,0 +1,56 @@
+
+
+
+
Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media
+
+
+
(long paper)
+
Authors: Ross Deans Kristensen-McLachlan, Rebecca M. M. Hicke, Márton Kardos and Mette Thunø
Does the People's Republic of China (PRC) interfere with European elections through ethnic Chinese diaspora media? This question forms the basis of an ongoing research project exploring how PRC narratives about European elections are represented in Chinese diaspora media, and thus the objectives of PRC news media manipulation. In order to study diaspora media efficiently and at scale, it is necessary to use techniques derived from quantitative text analysis, such as topic modelling. In this paper, we present a pipeline for studying information dynamics in Chinese media. Firstly, we present KeyNMF, a new approach to static and dynamic topic modelling using transformer-based contextual embedding models. We provide benchmark evaluations to demonstrate that our approach is competitive on a number of Chinese datasets and metrics. Secondly, we integrate KeyNMF with existing methods for describing information dynamics in complex systems. We apply this pipeline to data from five news sites, focusing on the period of time leading up to the 2024 European parliamentary elections. Our methods and results demonstrate the effectiveness of KeyNMF for studying information dynamics in Chinese media and lay groundwork for further work addressing the broader research questions.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper52.md b/content/papers/paper52.md
new file mode 100644
index 0000000..b081158
--- /dev/null
+++ b/content/papers/paper52.md
@@ -0,0 +1,56 @@
+
+
+
+
Extracting social connections from Finnish Karelian refugee interviews using LLMs
+
+
+
(long paper)
+
Authors: Joonatan Laato, Jenna Kanerva, John Loehr, Virpi Lummaa and Filip Ginter
We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indicating the degree of social integration of refugees in their new environment. Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies. We find that the best generative model (GPT-4) is roughly on par with human performance, at an F-score of 88.8 %. Interestingly, the best open generative model (Llama-3-70B-Instruct) reaches almost the same performance, at 87.7 % F-score, demonstrating that open models are becoming a viable alternative for some practical tasks even on non-English data. Additionally, we test a supervised learning alternative, where we fine-tune a Finnish BERT model (FinBERT) using GPT-4 generated training data. By this method, we achieved an F-score of 84.1 % already with 6K interviews up to an F-score of 86.3 % with 30k interviews. Such an approach would be particularly appealing in cases where the computational resources are limited, or there is a substantial mass of data to process.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper55.md b/content/papers/paper55.md
new file mode 100644
index 0000000..71a7a3f
--- /dev/null
+++ b/content/papers/paper55.md
@@ -0,0 +1,56 @@
+
+
+
+
Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norway’s Digitised Book Collection
+
+
+
(short paper)
+
Authors: Marie Roald, Magnus Breder Birkenes and Lars Johnsen
Digital tools for text analysis have long been essential for the searchability and accessibility of digitised library collections. Recent computer vision advances have introduced similar capabilities for visual materials, with deep learning-based embeddings showing promise for analysing visual heritage. Given that many books feature visuals in addition to text, taking advantage of these breakthroughs is critical to making library collections open and accessible. In this work, we present a proof-of-concept image search application for exploring images in the National Library of Norway’s pre-1900 books, comparing Vision Transformer (ViT), Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for Language-Image Pre-training (SigLIP) embeddings for image retrieval and classification. Our results show that the application performs well for exact image retrieval, with SigLIP embeddings slightly outperforming CLIP and ViT in both retrieval and classification tasks. Additionally, SigLIP-based image classification can aid in cleaning image datasets from a digitisation pipeline.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper57.md b/content/papers/paper57.md
new file mode 100644
index 0000000..be792ff
--- /dev/null
+++ b/content/papers/paper57.md
@@ -0,0 +1,56 @@
+
+
+
+
Admiration and Frustration: A Multidimensional Analysis of Fanfiction
+
+
+
(long paper)
+
Authors: Mia Jacobsen and Ross Deans Kristensen-McLachlan
Why do people write fanfiction? How, if at all, does fanfiction differ from the source material on which it is based? In this paper, we use quantitative text analysis to address these questions by investigating linguistic differences and similarities between fan-produced texts and their original sources. We analyze fanfiction based on Lord of the Rings, Harry Potter, and Percy Jackson and the Olympians. Working with a corpus of around 250,000 texts containing both fanfiction and sources, we draw on Biber's Multidimensional Analysis biber1988variation, scoring each text along six dimensions of functional variation. Our results identify both global and community-based preferences in the form and function of fanfiction. Crucially, fan-produced texts are found not to diverge from their source material in statistically meaningful ways, suggesting that fans mimic the writing style of the original author. Nevertheless, fans as a whole prefer stories with less focus on narrative and greater emphasis on character interactions than the source text. Our analysis supports the notion proposed by qualitative studies that fanfiction is motivated both by admiration for and frustration with the canon.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper59.md b/content/papers/paper59.md
new file mode 100644
index 0000000..9ee3e7c
--- /dev/null
+++ b/content/papers/paper59.md
@@ -0,0 +1,56 @@
+
+
+
+
Recognizing Non-named Spatial Entities in Literary Texts: A Novel Spatial Entities Classifier
+
+
+
(short paper)
+
Authors: Daniel Kababgi, Giulia Grisot, Federico Pennino and Berenike Herrmann
Predicting spatial representations in literature is a challenging task that requires advanced machine learning methods and manual annotations. In this paper, we present a study that leverages manual annotations and a BERT language model to automatically detect and recognise non-named spatial entities in a historical corpus of Swiss novels. The annotated data, consisting of Swiss narrative texts in German from the period of 1840 to 1950, was used to train the machine learning model and fine-tune a deep learning model specifically for literary German. The annotation process, facilitated by the use of Prodigy, enabled iterative improvement of the model’s predictions by selecting informative instances from the unlabelled data. Our evaluation metrics (F1 score) demonstrate the model’s ability to predict various categories of spatial entities in our corpus. This new method enables researchers to explore spatial representations in literary text, contributing both to digital humanities and literary studies. While our study shows promising results, we acknowledge challenges such as representativeness of the annotated data, biases in manual annotations, and domain-specific language. By addressing these limitations and discussing the implications of our findings, we provide a foundation for future research in sentiment and spatial analysis in literature. Our findings not only contribute to the understanding of literary narratives but also demonstrate the potential of automated spatial analysis in historical and literary research.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper6.md b/content/papers/paper6.md
new file mode 100644
index 0000000..fd5ec57
--- /dev/null
+++ b/content/papers/paper6.md
@@ -0,0 +1,56 @@
+
+
+
+
Quantitative Framework for Word-Color Association and Application to 20th Century Anglo-American Poetry
Color symbolism is considered a critical element in art and literature, yet determining the relationship between colors and words has remained largely subjective. This research presents a systematic methodology for quantifying the correlation between language and color. We utilize text-based image search, optical character recognition (OCR), and advanced image processing techniques to establish a connection between words and their corresponding color distributions in the CIELch color space. We generate a color dataset based on human cognition, and apply it for analysis of the literary works of poets associated with Imagism and Black Arts Movements. This helps uncover the characteristic color patterns and symbolic meanings of the movements with enhanced objectivity and reproducibility in literature research. Our work has the potential to provide a powerful instrument for a systematic, quantitative examination of literary symbolism, filling in the gaps in prior analyses and facilitating novel investigations of thematic aspects using color.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper60.md b/content/papers/paper60.md
new file mode 100644
index 0000000..b3f9040
--- /dev/null
+++ b/content/papers/paper60.md
@@ -0,0 +1,56 @@
+
+
+
+
SCIENCE IS EXPLORATION: Computational Frontiers for Conceptual Metaphor Theory
+
+
+
(short paper)
+
Authors: Rebecca M. M. Hicke and Ross Deans Kristensen-McLachlan
Metaphors are everywhere. They appear extensively across all domains of natural language, from the most sophisticated poetry to seemingly dry academic prose. A significant body of research in the cognitive science of language argues for the existence of conceptual metaphors, the systematic structuring of one domain of experience in the language of another. Conceptual metaphors are not simply rhetorical flourishes but are crucial evidence of the role of analogical reasoning in human cognition. In this paper, we ask whether Large Language Models (LLMs) can accurately identify and explain the presence of such conceptual metaphors in natural language data. Using a novel prompting technique based on metaphor annotation guidelines, we demonstrate that LLMs are a promising tool for large-scale computational research on conceptual metaphors. Further, we show that LLMs are able to apply procedural guidelines designed for human annotators, displaying a surprising depth of linguistic knowledge.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper61.md b/content/papers/paper61.md
new file mode 100644
index 0000000..47d5f30
--- /dev/null
+++ b/content/papers/paper61.md
@@ -0,0 +1,56 @@
+
+
+
+
Bootstrap Distance Imposters: High Precision Authorship Verification with Improved Interpretability
This paper describes an update to the open-source Python implementation of the General Imposters method of authorship verification by Mike Kestemont et al. The new algorithm, called Bootstrap Distance Imposters (henceforth BDI), incorporates a key improvement introduced by Potha and Stamatatos, as well as introducing a novel method of bootstrapping that has several attractive properties when compared to the reference algorithm. Initially, we supply an updated version of the Kestemont et al. code (for Python 3.x) which incorporates the same basic improvements. Next, the two approaches are benchmarked using the problems from the multi-lingual PAN 2014 author identification task, as well as the more recent PAN 2021 task. Additionally, the interpretability advantages of BDI are showcased via real-world verification studies. When operating as a summary verifier, BDI tends to be more conservative in its positive attributions, particularly when applied to difficult problem sets like the PAN2014 en \_ novels. In terms of raw performance, the BDI verifier outperforms all PAN2014 entrants and appears slightly stronger than the improved Kestemont GI according to the PAN metrics for both the 2014 and 2021 problems, while also offering superior interpretability.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper62.md b/content/papers/paper62.md
new file mode 100644
index 0000000..10657b8
--- /dev/null
+++ b/content/papers/paper62.md
@@ -0,0 +1,56 @@
+
+
+
+
Addressing Uncertainty According to the Annotator’s Expertise in Archaeological Data Collections: an Approach from Fuzzy Logic
+
+
+
(short paper)
+
Authors: Patricia Martin-Rodilla and Leticia Tobalina
Archaeological data allow us to synthetically represent the past of individuals and communities over time. This complex representation task requires an amalgamation of variables and makes the intrinsic data vagueness. The study of vagueness as an archaeological data dimension has become a dynamic focus of archaeologists' work in recent years, presenting theoretical and practical approaches for the representation, mainly with fuzzy logic, of archaeological variables. Vagueness in archaeological data can occur due to different reasons: non-existence of evidence, imprecision, errors, subjectivity, etc. Furthermore, the data is usually managed in groups, shared or recovered for subsequent investigations, so the vagueness traceability that is injected due to these management phases is lost. In this paper we present the ongoing work carried out in modeling under fuzzy formal theory the explicit representation of the expertise of the annotator (understood as the professional who introduces archaeological data into a certain system, giving value to the defined variables) in a decoupled way from the value attributed to each variable. The first experiments with chronological and use variables of the sites show how making the annotator's expertise explicit in the fuzzy model allows maintaining the traceability of the uncertainty injected into the archaeological data due to the definition and management of the datasets by different people, as well as establishes a base for implementing archaeological fuzzy decision-based systems.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper67.md b/content/papers/paper67.md
new file mode 100644
index 0000000..df4bb14
--- /dev/null
+++ b/content/papers/paper67.md
@@ -0,0 +1,56 @@
+
+
+
+
In the Context of Narrative, We Never Properly Defined the Concept of Valence
+
+
+
(long paper)
+
Authors: Peter Boot, Angel Daza, Carsten Schnober and Willem van Hage
Valence is a concept that is increasingly being used in the computational study of narrative texts. We discuss the history of the concept and show that the word has been interpreted in various ways. Then we look at a number of Dutch tools for measuring valence. We use them on sample fragments from a large collection of narrative texts and find only moderate correlations between the valences as established by the various tools. We discuss these differences and how to handle them. We argue that the root cause of the problem is that Computational Literary Studies never properly defined the concept of valence in a narrative context.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper70.md b/content/papers/paper70.md
new file mode 100644
index 0000000..b535ff8
--- /dev/null
+++ b/content/papers/paper70.md
@@ -0,0 +1,56 @@
+
+
+
+
Locating the Leading Edge of Cultural Change
+
+
+
(short paper)
+
Authors: Sarah Griebel, Becca Cohen, Lucian Li, Jiayu Liu, Jaihyun Park, Jana Perkins and Ted Underwood
Measures of textual similarity and divergence are increasingly used to study cultural change. But which measures align, in practice, with social evidence about change? We apply three different representations of text (topic models, document embeddings, and word-level perplexity) to three different corpora (literary studies, economics, and fiction). In every case, works by highly-cited authors and younger authors are textually ahead of the curve. We don't find clear evidence that one representation of text is to be preferred over the others. But alignment with social evidence is strongest when texts are represented through the top quartile of passages, suggesting that a text's impact may depend more on its most forward-looking moments than on sustaining a high level of innovation throughout.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper71.md b/content/papers/paper71.md
new file mode 100644
index 0000000..d06499c
--- /dev/null
+++ b/content/papers/paper71.md
@@ -0,0 +1,56 @@
+
+
+
+
Transformation of Composition and Gaze Interaction in Noli Me Tangere Depictions from 1300–1600
+
+
+
(short paper)
+
Authors: Pepe Ballesteros Zapata, Nina Arnold, Vappu Vilhelmiina Lukander, Ludovica Schaerf and Dario Negueruela del Castillo
This paper examines the development of figure composition and gaze dynamics between Mary Magdalene and Christ in Italian noli me tangere depictions from 1300 to 1600 in the context of the emergence of perspective painting. It combines a conceptual, interpretative approach concerning the tactility of the gaze with a compositional analysis. This preliminary study analyzes 51 iconographical images to understand how the gazes between Mary and Christ evolve from pre-perspective to perspective artworks. We estimate gaze direction solely from landmark points, following the assumption that the gaze direction can be estimated from the overall face orientation. Additionally, we develop a metric to quantify the degree of visual interaction between the two protagonists. Our results indicate that Christ is consistently depicted gazing down towards Mary, while Mary displays a broader range of gaze directions. Before the introduction of perspective, the gaze of figures was often rendered solely through face orientation. However, with the advent of the high renaissance, artists began to use complex gestures that separated head orientation from the line of sight.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper73.md b/content/papers/paper73.md
new file mode 100644
index 0000000..c79b3e1
--- /dev/null
+++ b/content/papers/paper73.md
@@ -0,0 +1,56 @@
+
+
+
+
Page Embeddings: Extracting and Classifying Historical Documents with Generic Vector Representations
+
+
+
(short paper)
+
Authors: Carsten Schnober, Renate Smit, Manjusha Kuruppath, Kay Pepping, Leon van Wissen and Lodewijk Petram
We propose a neural network architecture designed to generate region and page embeddings for boundary detection and classification of documents within a large and heterogeneous historical archive. Our approach is versatile and can be applied to other tasks and datasets. This method enhances the accessibility of historical archives and promotes a more inclusive utilization of historical materials.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper74.md b/content/papers/paper74.md
new file mode 100644
index 0000000..812fb4b
--- /dev/null
+++ b/content/papers/paper74.md
@@ -0,0 +1,56 @@
+
+
+
+
Direct and Indirect Annotation with Generative AI: A Case Study into Finding Animals and Plants in Historical Text
+
+
+
(short paper)
+
Authors: Arjan van Dalfsen, Folgert Karsdorp, Ayoub Bagheri, Thirza van Engelen, Dieuwertje Mentink and Els Stronks
This study explores the use of generative AI (GenAI) for annotation in the humanities, comparing direct and indirect annotation approaches with human annotations. Direct annotation involves using GenAI to annotate the entire corpus, while indirect annotation uses GenAI to create training data for a specialized model. The research investigates zero-shot and few-shot methods for direct annotation, alongside an indirect approach incorporating active learning, few-shotting, and k-NN example retrieval. The task focuses on identifying words (also referred to as entities) related to plants and animals in Early Modern Dutch texts. Results show that indirect annotation outperforms zero-shot direct annotation in mimicking human annotations. However, with just a few examples, direct annotation catches up, achieving similar performance to indirect annotation. Analysis of confusion matrices reveals that GenAI annotators make similar types of mistakes, such as confusing parts and products or failing to identify entities, which are broader than those made by humans. Manual error analysis indicates that each annotation method (human, direct, and indirect) has some unique errors. Given the limited scale of this study, it is worthwhile to further explore the relative affordances of direct and indirect GenAI annotation methods.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper75.md b/content/papers/paper75.md
new file mode 100644
index 0000000..f908a19
--- /dev/null
+++ b/content/papers/paper75.md
@@ -0,0 +1,56 @@
+
+
+
+
Combining Automatic Annotation Tools with Human Validation for the Semantic Enrichment of Cultural Heritage Metadata
+
+
+
(long paper)
+
Authors: Eirini Kaldeli, Alexandros Chortaras, Vassilis Lyberatos, Jason Liartis, Spyridon Kantarelis and Giorgos Stamou
The addition of controlled terms from linked open datasets and vocabularies to metadata can increase the discoverability and accessibility of digital collections. However, the task of semantic enrichment requires a lot of effort and resources that cultural heritage organizations often lack. State-of-the-art AI technologies can be employed to analyse textual metadata and match it with external semantic resources. Depending on the data characteristics and the objective of the enrichment, different approaches may need to be combined to achieve high-quality results. What is more, human inspection and validation of the automatic annotations should be an integral part of the overall enrichment methodology. In the current paper, we present a methodology and supporting digital platform, which combines a suite of automatic annotation tools with human validation for the enrichment of cultural heritage metadata within the European data space for cultural heritage. The methodology and platform have been applied and evaluated on a set of datasets on crafts heritage, leading to the publication of more than 133K enriched records to the Europeana platform. A statistical analysis of the achieved results is performed, which allows us to draw some interesting insights as to the appropriateness of annotation approaches in different contexts. The process also led to the creation of an openly available annotated dataset, which can be useful for the in-domain adaptation of ML-based enrichment tools.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper76.md b/content/papers/paper76.md
new file mode 100644
index 0000000..36fd920
--- /dev/null
+++ b/content/papers/paper76.md
@@ -0,0 +1,56 @@
+
+
+
+
Literary Canonicity and Algorithmic Fairness: The Effect of Author Gender on Classification Models
+
+
+
(long paper)
+
Authors: Ida Marie S. Lassen, Pascale Feldkamp, Yuri Bizzoni and Kristoffer Nielbo
This study examines gender biases in machine learning models that predict literary canonicity. Using algorithmic fairness metrics like equality of opportunity, equalised odds, and calibration within groups, we show that models violate the fairness metrics, especially by misclassifying non-canonical books by men as canonical. Feature importance analysis shows that text-intrinsic differences between books by men and women authors contribute to these biases. Men have historically dominated canonical literature, which may bias models towards associating men-authored writing styles with literary canonicity. Our study highlights how these biased models can lead to skewed interpretations of literary history and canonicity, potentially reinforcing and perpetuating existing gender disparities in our understanding of literature. This underscores the need to integrate algorithmic fairness in computational literary studies and digital humanities more broadly to foster equitable computational practices.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper79.md b/content/papers/paper79.md
new file mode 100644
index 0000000..6089606
--- /dev/null
+++ b/content/papers/paper79.md
@@ -0,0 +1,56 @@
+
+
+
+
Exploration of Event Extraction Techniques in Late Medieval and Early Modern Administrative Records
While an increasing amount of studies exploring named entity recognition in historical corpora are published, application of other information extraction tasks such as event extraction remains scarce. This study explores two accessible methods to facilitate the detection of events and the classification of entities into roles: rule-based systems and RNN-based machine learning techniques. We focus on a German-language corpus from the 15th-17th c. and property purchases as the event types. We show that these relatively simple methods can retrieve useful information and discuss ideas to further enhance the results.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper82.md b/content/papers/paper82.md
new file mode 100644
index 0000000..87c2c52
--- /dev/null
+++ b/content/papers/paper82.md
@@ -0,0 +1,56 @@
+
+
+
+
Assessing Landscape Intervisibility and Prominence at Regional Scale
Visibility and intervisibility have been important aspects of spatial analysis in landscape archaeological studies, but remain hampered by computational intensity, small-scale study area, edge effects, and bare-earth digital elevation models. This paper assesses intervisibility and prominence in a dataset of over 1000 burial mounds in the Middle Tundzha River watershed in Bulgaria. The aim is to obviate the pitfalls in regional assessment of visibility through vegetation simulation and MC modelling and to gauge when intervisibility and prominence truly mattered to past mound-builders.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper86.md b/content/papers/paper86.md
new file mode 100644
index 0000000..3832f8a
--- /dev/null
+++ b/content/papers/paper86.md
@@ -0,0 +1,56 @@
+
+
+
+
Univariate Statistical Analysis of a Non-Canonical Literary Genre: Quantifying German-Language One-Act Plays (1740–1850)
+
+
+
(long paper)
+
Authors: Viktor J. Illmer, Dîlan Canan Çakir, Frank Fischer and Lilly Welz
This article explores the use of metadata to analyse German-language one-act plays from 1740 to 1850, addressing the need to expand beyond canonical texts in literary studies. Utilising the Database of German-Language One-Act Plays, we examine aspects such as the number of scenes and characters as well as the role of different original languages on which the translated plays in the corpus are based. We find that one-act plays exhibit strong genre signals that set them apart from multi-act plays of the time. Our metadata-driven approach provides a comprehensive and statistically grounded understanding of the genre, demonstrating the potential of digital methods to enhance genre studies and overcome traditional limitations in literary scholarship.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper9.md b/content/papers/paper9.md
new file mode 100644
index 0000000..397f0a4
--- /dev/null
+++ b/content/papers/paper9.md
@@ -0,0 +1,56 @@
+
+
+
+
Multilingual Stylometry: The influence of language and corpus composition on the performance of authorship attribution using corpora from the European Literary Text Collection (ELTeC)
+
+
+
(long paper)
+
Authors: Christof Schoech, Julia Dudar, Evgeniia Fileva and Artjoms Šeļa
Stylometric authorship attribution is concerned with the task of assigning texts of unknown, pseudonymous or disputed authorship to their most likely author, often based on a comparison of the frequency of a selected set of features that represent the texts. The parameters of the analysis, such as feature selection and the choice of similarity measure or classification algorithm, have received significant attention in the past. Two additional key factors for the performance and reliability of stylometric methods, however, have so far received less attention, namely corpus composition and corpus language. As a first step, the aim of this study is to investigate the influence of language on the performance of stylometric authorship attribution. We address this question using four different corpora derived from the European Literary Text Collection (ELTeC). We use machine-translation to obtain each corpus in the other three languages. We find that, as expected, the attribution accuracy varies between language-based corpora, and that translated corpora, on average, display a lower attribution accuracy compared to their counterparts in the original language. Overall, our study contributes to a better understanding of stylometric methods of authorship attribution.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper90.md b/content/papers/paper90.md
new file mode 100644
index 0000000..985bf8f
--- /dev/null
+++ b/content/papers/paper90.md
@@ -0,0 +1,56 @@
+
+
+
+
Animacy in German Folktales
+
+
+
(short paper)
+
Authors: Julian Häußler, Janis von Keitz and Evelyn Gius
This paper explores the phenomenon of animacy in prose by the example of German folktales. We present a manually annotated corpus of 19 German folktales from the Brothers Grimm collection and train a classifier on these annotations. Building on previous work in animacy detection, we evaluate the classifier’s performance and its application to a larger corpus. The findings highlight the complexity of animacy in literary texts, distinguishing it from named entity recognition and emphasizing the classifier’s potential for enhancing character recognition in narratives.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper92.md b/content/papers/paper92.md
new file mode 100644
index 0000000..8a79c8a
--- /dev/null
+++ b/content/papers/paper92.md
@@ -0,0 +1,56 @@
+
+
+
+
Domain Adaptation with Linked Encyclopedic Data: A Case Study for Historical German
This paper outlines a proposal for the use of knowledge graphs for historical German domain adaptation. From the EncycNet project, the encyclopedia-based knowledge graph from the early 20th century was borrowed to examine whether text-based domain adaptation using the source encyclopedia's text or graph-based adaptation produces a better domain-specific model. To evaluate the approach, a novel historical test dataset based on a second encyclopedia of the early 20th century was created. This dataset is categorized by knowledge type (factual, linguistic, lexical) with special attention paid to distinguishing simple and expert knowledge. The main finding is that, surprisingly, simple knowledge has the most potential for improvement, whereas expert knowledge lags behind. In this study, broad signals like simple definitions and word origin yielded the best results, while more specialized knowledge such as synonyms were not as effectively represented. A follow-up study was carried out in favor of simple contemporary lexical knowledge to control for historicity and text genre, where the results confirm that language models can still be enhanced by incorporating simple lexical knowledge using the proposed workflow.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper93.md b/content/papers/paper93.md
new file mode 100644
index 0000000..04430fd
--- /dev/null
+++ b/content/papers/paper93.md
@@ -0,0 +1,56 @@
+
+
+
+
And then I saw it: Testing Hypotheses on Turning Points in a Corpus of UFO Sighting Reports
+
+
+
(short paper)
+
Authors: Jan Langenhorst, Robert C. Schuppe and Yannick Frommherz
As part of developing a Computational Narrative Understanding, modeling events within stories has recently received significant attention within the digital humanities community. Most of the current research aims at good performance when predicting events. By contrast, we explore a focused approach based on qualitative observations. We attempt to trace the role of structural elements – more specifically, temporal function words – that may be characteristic of a narrative's turning point. We draw on a corpus of UFO sighting reports in which authors employ a prototypical narrative structure that relies on a turning point at which the extraordinary intrudes the ordinary. Using binary logistic regression, we can identify structural properties which are indicative of turning points in our data, showcasing that a focus on detail can fruitfully complement NLP models in gaining a quantitatively informed understanding of narratives.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper94.md b/content/papers/paper94.md
new file mode 100644
index 0000000..d1a0ea2
--- /dev/null
+++ b/content/papers/paper94.md
@@ -0,0 +1,56 @@
+
+
+
+
Revolution + Love: Measuring the Entanglements of Violence and Emotions in Post-1949 China
This paper examines the relationship between violent discourse and emotional intensity in the early revolutionary rhetoric of the People's Republic of China (PRC). Using two fine-tuned bert-base-chinese models—one for detecting violent content in texts and another for assessing their affective charge—we analyze over 185,000 articles published between 1956 and 1989 in the People's Liberation Army Daily ( Jiefangjun Bao ), the official journal of China's armed forces. We find a statistically significant correlation between violent discourse and emotional expression throughout the analyzed period. This strong alignment between violence and affect in official texts provides a valuable context for appreciating how other forms of writing, such as novels and poetry, can disentangle personal emotions from state power.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper95.md b/content/papers/paper95.md
new file mode 100644
index 0000000..3363e66
--- /dev/null
+++ b/content/papers/paper95.md
@@ -0,0 +1,56 @@
+
+
+
+
Tracing the Development of the Virtual Particle Concept Using Semantic Change Detection
Virtual particles are peculiar objects. They figure prominently in much of theoretical and experimental research in elementary particle physics. But exactly what they are is far from obvious. In particular, to what extent they should be considered real remains a matter of controversy in philosophy of science. Also their origin and development has only recently come into focus of scholarship in the history of science. In this study, we propose using the intriguing case of virtual particles to discuss the efficacy of Semantic Change Detection (SCD) based on contextualized word embeddings from a domain-adapted BERT model in studying specific scientific concepts. We find that the SCD metrics align well with qualitative research insights in the history and philosophy of science, as well as with the results obtained from Dependency Parsing to determine the frequency and connotations of the term virtual. Still, the metrics of SCD provide additional insights over and above the qualitative research and the Dependency Parsing. Among other things, the metrics suggest that the concept of the virtual particle became more stable after 1950 but at the same time also more polysemous.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper96.md b/content/papers/paper96.md
new file mode 100644
index 0000000..ca71e97
--- /dev/null
+++ b/content/papers/paper96.md
@@ -0,0 +1,56 @@
+
+
+
+
Remember to forget: A study on verbatim memorization of literature in Large Language Models
+
+
+
(long paper)
+
Authors: Xinhao Zhang, Olga Seminck and Pascal Amsili
We examine the extent to which English and French literature is memorized by freely accessible LLMs, using a name cloze inference task (which focuses on the model's ability to recall proper names from a book). We replicate the key findings of previous research conducted with OpenAI models, concluding that, overall, the degree of memorization is low. Factors that tend to enhance memorization include the absence of copyrights, belonging to the Fantasy or Science Fiction genres, and the work's popularity on the Internet. Delving deeper into the experimental setup using the open source model Olmo and its freely available corpus Dolma, we conducted a study on the evolution of memorization during the LLM’s training phase. Our findings suggest that excerpts of a book online can result in some level of memorization, even if the full text is not included in the training corpus. This observation leads us to conclude that the name cloze inference task is insufficient to definitively determine whether copyright violations have occurred during the training process of an LLM. Furthermore, we highlight certain limitations of the name cloze inference task, particularly the possibility that a model may recognize a book without memorizing its text verbatim. In a pilot experiment, we propose an alternative method that shows promise for producing more robust results.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper97.md b/content/papers/paper97.md
new file mode 100644
index 0000000..0362a50
--- /dev/null
+++ b/content/papers/paper97.md
@@ -0,0 +1,56 @@
+
+
+
+
Latent Structures of Intertextuality in French Fiction
Intertextuality is a key concept in literary theory that challenges traditional notions of text, signification or authorship. It views texts as part of a vast intertextual network that is constantly evolving and being reconfigured. This paper argues that the field of computational literary studies is the ideal place to conduct a study of intertextuality since we have now the ability to systematically compare texts with each others. Specifically, we present a work on a corpus of more than 12.000 French fictions from the 18th, 19th and early 20th century. We focus on evaluating the underlying roles of two literary notions, sub-genres and the literary canon in the framing of textuality. The article attempts to operationalize intertextuality using state-of-the-art contextual language models to encode novels and capture features that go beyond simple lexical or thematic approaches. Our findings suggest that both subgenres and canonicity play a significant role in shaping textual similarities within French fiction. These discoveries point to the importance of considering genre and canon as dynamic forces that influence the evolution and intertextual connections of literary works within specific historical contexts.
+
+
+
\ No newline at end of file
diff --git a/content/papers/paper98.md b/content/papers/paper98.md
new file mode 100644
index 0000000..2b267a5
--- /dev/null
+++ b/content/papers/paper98.md
@@ -0,0 +1,56 @@
+
+
+
+
Sentiment Below the Surface: Omissive and Evocative Strategies in Literature and Beyond
+
+
+
(long paper)
+
Authors: Pascale Feldkamp, Ea Lindhardt Overgaard, Kristoffer Nielbo and Yuri Bizzoni
As they represent one of the most complex forms of expression, literary texts continue to challenge Sentiment Analysis (SA) tools, often developed for other domains. At the same time, SA is becoming an increasingly central method in literary analysis itself, which raises the question of what are the challenges inherent to literary SA. We address this question by probing units from a variety of literary fiction texts where humans and systems diverge in their valence scoring, seeking to relate such disagreements to semantic traits central to implicit sentiment evocation in literary theory. The contribution of this study is twofold. First, we present a corpus of valence-annotated fiction -- English and Danish language literary texts from the 19 th and 20 th centuries -- representing different genres. We then test whether sentences where humans and models disagree in sentiment annotation are characterized by specific semantic traits by looking at their distribution and correlation across four different corpora. We find that items where humans detected significant sentiment, but where models did not, consistently employ lower levels of arousal, dominance and interoception, and higher levels of concreteness. Furthermore, we find that the amount of human-model disagreement correlated with semantic aspects is linked to the interiority-exteriority continuum more than with direct sensory information. Finally, we show that this interaction of features linked to implicit sentiment varies across textual domains. Our findings confirm that sentiment evocation exploits a more diverse and subtle set of semantic channels than those observed through simple sentiment analysis.
+
+
+
\ No newline at end of file
diff --git a/content/programme.md b/content/programme.md
index d1cb8ce..bda6793 100644
--- a/content/programme.md
+++ b/content/programme.md
@@ -75,11 +75,24 @@ input:focus-visible + label {
padding: 30px 0;
border-top: 1px solid #ccc;
}
+.paper-entry {
+ font-size: 1.1em;
+ margin-bottom: 1.5em;
+ }
+.paper-title {
+ display: block;
+}
+.paper-authors {
+ margin-left: 1.25em; /* indent authors */
+ display: block;
+}
-Programme for the [pre-conference workshops](#parallel-workshops) on [Tuesday](#tuesday), 3rd December 2024, and the main conference days on [Wednesday](#wednesday), [Thursday](#thursday), and [Friday](#friday), 4th-6th December 2024.
+Programme for the [pre-conference workshops](#parallel-workshops) on [Tuesday](#tuesday), 3rd December 2024, and the main conference days on [Wednesday](#wednesday), [Thursday](#thursday), and [Friday](#friday), 4th-6th December 2024.
-*Please note that all times are in Central European Time (CET)*
+You can also get an overview of all accepted papers [here](/papers).
+
+*Note: Changes may occur in the programme. Please check regularly for the latest information.*
@@ -102,6 +115,7 @@ Programme for the [pre-conference workshops](#parallel-workshops) on [Tuesday](#
Tuesday, December 3, 2024 (Pre-conference workshops)
diff --git a/docs/index.xml b/docs/index.xml
index de74a6c..ed7fbb2 100644
--- a/docs/index.xml
+++ b/docs/index.xml
@@ -78,6 +78,405 @@
http://2024.computational-humanities-research/cfp/In the arts and humanities, the use of computational, statistical, and mathematical approaches has considerably increased in recent years. This research is characterized by the use of formal methods and the construction of explicit, computational models. This includes quantitative, statistical approaches, but also more generally computational methods for processing and analyzing data, as well as theoretical reflections on these approaches. Despite the undeniable growth of this research area, many scholars still struggle to find suitable research-oriented venues to present and publish computational work that does not lose sight of traditional modes of inquiry in the arts and humanities.
+
+
+ http://2024.computational-humanities-research/papers/paper1/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper1/
+ Explainable Search and Discovery of Visual Cultural Heritage Collections with Multimodal Large Language Models (long paper)
Authors: Taylor Arnold
Presented in Session 5B: Search & Discovery
Paper: Download PDF
Abstract Many cultural institutions have made large digitized visual collections available online, often under permissible re-use licences. Creating interfaces for exploring and searching these collections is difficult, particularly in the absence of granular metadata. In this paper, we introduce a method for using state-of-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections.
+
+
+
+ http://2024.computational-humanities-research/papers/paper102/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper102/
+ Greatest Hits Versus Deep Cuts: Exploring Variety in Setlists Across Artists and Musical Genres (long paper)
Authors: Edward Abel and Andrew Goddard
Presented in Session 8B: Popular Media
Paper: Download PDF
Abstract Live music concert analysis provides an opportunity to explore cultural and historical trends. The art of set-list construction, of which songs to play, has many considerations for an artist, and the notion of how much variety different artists play is an interesting topic.
+
+
+
+ http://2024.computational-humanities-research/papers/paper104/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper104/
+ Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the 'Lancelot en prose' (long paper)
Authors: Lucence Ing, Matthias Gille Levenson and Jean-Baptiste Camps
Presented in Session 6B: Multilingualism & Translation Studies
Paper: Download PDF
Abstract This study focuses on the problem of multilingual medieval text alignment, which presents specific challenges, due to the absence of modern punctuation in the texts and the non-standard forms of medieval languages. In order to perform the alignment of several witnesses from the multilingual tradition of the prose Lancelot, we first develop an automatic text segmenter based on BERT and then align the produced segments using Bertalign.
+
+
+
+ http://2024.computational-humanities-research/papers/paper106/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper106/
+ Patterns of Quality: Comparing Reader Reception Across Fanfiction and Published Literature (long paper)
Authors: Mia Jacobsen, Pascale Moreira, Kristoffer Nielbo and Yuri Bizzoni
Presented in Session 3A: Literary Canon & Reception
Paper: Download PDF
Abstract Recent work on the textual features linked to literary quality has primarily focused on commercially published literature, such as canonical or best-selling novels, that are systematically filtered by editorial and market mechanisms. However, the biggest repositories of fiction texts currently in existence are free fanfiction websites, where fans post fictional stories about their favorite characters for the pleasure of writing and engaging with others.
+
+
+
+ http://2024.computational-humanities-research/papers/paper110/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper110/
+ Steps Towards Mining Manuscript Images for Untranscribed Texts: A Case Study from the Syriac Collection at the Vatican Library (long paper)
Authors: Luigi Bambaci, George Kiraz, Christine M. Roughan, Daniel Stökl Ben Ezra and Matthieu Freyder
Presented in Session 4B: Automatic Text Recognition
Paper: Download PDF
Abstract Digital libraries and databases of texts are invaluable resources for researchers, yet their reliance on printed editions can lead to significant gaps and potentially exclude works without printed reproductions.
+
+
+
+ http://2024.computational-humanities-research/papers/paper119/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper119/
+ On Classification with Large Language Models in Cultural Analytics (long paper)
Authors: David Bamman, Kent Chang, Li Lucy and Naitian Zhou
Presented in Session 4A: Large Language Models
Paper: Download PDF
Abstract In this work, we survey the way in which classification is used as a sensemaking practice in cultural analytics, and assess where large language models can fit into this landscape. We identify ten tasks supported by publicly available datasets on which we empirically assess the performance of LLMs compared to traditional supervised methods, and explore the ways in which LLMs can be employed for sensemaking goals beyond mere accuracy.
+
+
+
+ http://2024.computational-humanities-research/papers/paper121/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper121/
+ Promises from an Inferential Approach in Classical Latin Authorship Attribution (short paper)
Authors: Giulio Tani Raffaelli
Presented in Session 3B: Stylometry
Paper: Download PDF
Abstract Applying stylometry to Authorship Attribution requires distilling the elements of an author's style sufficient to recognise their mark in anonymous documents. Often, this is accomplished by contrasting the frequency of selected features in the authors' works. A recent approach, CP2D, uses innovation processes to infer the author's identity, accounting for their propensity to introduce new elements.
+
+
+
+ http://2024.computational-humanities-research/papers/paper122/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper122/
+ A Preliminary Analysis of ChatGPT's Poetic Style (long paper)
Authors: Melanie Walsh, Anna Preus and Elizabeth Gronski
Presented in Session 4A: Large Language Models
Paper: Download PDF
Abstract Generating poetry has become a popular application of LLMs, perhaps especially of OpenAI's widely-used chatbot ChatGPT. What kind of poet is ChatGPT? Does ChatGPT have its own poetic style? Can it successfully produce poems in different styles? To answer these questions, we prompt the GPT-3.
+
+
+
+ http://2024.computational-humanities-research/papers/paper123/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper123/
+ Deciphering Still Life Artworks with Linked Open Data (short paper)
Authors: Bruno Sartini
Presented in Session 1A: Visual Arts and Art History
Paper: Download PDF
Abstract The still life genre is a good example of how even the simplest elements depicted in an artwork can be carriers of deeper, symbolic meanings that influence the overall artistic interpretation of it. In this paper, we present an ongoing study on the use of linked open data (LOD) to quantitatively analyze the symbolic meanings of still life paintings.
+
+
+
+ http://2024.computational-humanities-research/papers/paper124/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper124/
+ Once More, With Feeling: Measuring Emotion of Acting Performances in Contemporary American Film (long paper)
Authors: Naitian Zhou and David Bamman
Presented in Session 7B: Measuring Emotion & Sentiment
Paper: Download PDF
Abstract Narrative film is a composition of writing, cinematography, editing, and performance. While much computational work has focused on the writing or visual style in film, we conduct in this paper a computational exploration of acting performance. Applying speech emotion recognition models and a variationist sociolinguistic analytical framework to a corpus of popular, contemporary American film, we find narrative structure, diachronic shifts, and genre- and dialogue-based constraints located in spoken performances.
+
+
+
+ http://2024.computational-humanities-research/papers/paper128/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper128/
+ Automatic Translation Alignment Pipeline for Multilingual Digital Editions of Literary Works (short paper)
Authors: Maria Levchenko
Presented in Session 6B: Multilingualism & Translation Studies
Paper: Download PDF
Abstract This paper investigates the application of translation alignment algorithms in the creation of a Multilingual Digital Edition (MDE) of Alessandro Manzoni's Italian novel I promessi sposi (``The Betrothed''), with translations in eight languages (English, Spanish, French, German, Dutch, Polish, Russian and Chinese) from the 19th and 20th centuries.
+
+
+
+ http://2024.computational-humanities-research/papers/paper13/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper13/
+ Beyond the Register: Demographic Modeling of Arrest Patterns in 1879-1880 Brussels (long paper)
Authors: Folgert Karsdorp, Mike Kestemont and Margo De Koster
Presented in Session 7A: Social Patterns
Paper: Download PDF
Abstract Unseen species models from ecology have recently been applied to censored historical cultural datasets to estimate unobserved populations. We extend this approach to historical criminology, analyzing the police registers of Brussels' Amigo prison (1879-1880) using the Generalized Chao estimator.
+
+
+
+ http://2024.computational-humanities-research/papers/paper130/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper130/
+ Characterizing the Subversion of Social Relationships on Television (long paper)
Authors: Kent Chang, Anna Ho and David Bamman
Presented in Session 1B: Classification & Information Extraction
Paper: Download PDF
Abstract Television is often seen as a site for subcultural identification and subversive fantasy, including in queer cultures. How might we measure subversion, or the degree to which the depiction of social relationship between a dyad (e.g. two characters who are colleagues) deviates from its typical representation on TV?
+
+
+
+ http://2024.computational-humanities-research/papers/paper132/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper132/
+ Treating Games as Plays? Computational Approaches to the Detection Scenes of Game Dialogs (short paper)
Authors: Martin Schlenk, Thomas Efer and Manuel Burghardt
Presented in Session 8B: Popular Media
Paper: Download PDF
Abstract Digital games are a complex multimodal phenomenon that is examined in a variety of ways by the highly interdisciplinary field of game studies. In this article, we focus on the structural aspect of the diegetic language of games and examine the extent to which established methods of computational drama analysis can also be successfully applied to digital games.
+
+
+
+ http://2024.computational-humanities-research/papers/paper135/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper135/
+ Early Modern Book Catalogues and Multilingualism: Identifying Multilingual Texts and Translations Using Titles (long paper)
Authors: Yann Ciarán Ryan and Margherita Fantoli
Presented in Session 6B: Multilingualism & Translation Studies
Paper: Download PDF
Abstract With this paper we aim to assess whether Early Modern book titles can be exploited to track two aspects of multilingualism in book publishing: publications featuring multiple languages and the distinction between editions of works in their original language and in translation.
+
+
+
+ http://2024.computational-humanities-research/papers/paper137/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper137/
+ On the Unity of Literary Change: The Development of Emotions in German Poetry, Prose, and Drama between 1850 and 1920 as a Test Case (long paper)
Authors: Leonard Konle, Merten Kröncke, Fotis Jannidis and Simone Winko
Presented in Session 8A: Cultural Dynamics
Paper: Download PDF
Abstract In this study, we use the development of emotions in German-language poetry, drama, and prose from 1850 to 1920 to informally test three hypotheses about literature: (1) Literature is a unified field, and therefore genres develop similarly at the same time.
+
+
+
+ http://2024.computational-humanities-research/papers/paper141/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper141/
+ Computational Segmentation of Wayang Kulit Video Recordings using a Cross-Attention Temporal Model (short paper)
Authors: Shawn Hong Wei Liew and Miguel Escobar Varela
Presented in Session 1A: Visual Arts and Art History
Paper: Download PDF
Abstract We report preliminary findings on a novel approach to automatically segment Javanese wayang kulit (traditional leather puppet) performances using computational methods. We focus on identifying comic interludes, which have been the subject of scholarly debate regarding their increasing duration.
+
+
+
+ http://2024.computational-humanities-research/papers/paper15/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper15/
+ Abbreviation Application: A Stylochronometric Study of Abbreviations in the Oeuvre of Herne’s Speculum Scribe (short paper)
Authors: Caroline Vandyck and Mike Kestemont
Presented in Session 3B: Stylometry
Paper: Download PDF
Abstract This research examines the Carthusian monastery of Herne, a major cultural hotspot during the Middle Ages. Between 1350 and 1400, the monks residing in Herne produced an impressive 46 production units, with 40 of them written in the Middle Dutch vernacular.
+
+
+
+ http://2024.computational-humanities-research/papers/paper17/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper17/
+ Integrating Visual and Textual Inputs for Searching Large-Scale Map Collections with CLIP (long paper)
Authors: James Mahowald and Benjamin Lee
Presented in Session 5B: Search & Discovery
Paper: Download PDF
Abstract Despite the prevalence and historical importance of maps in digital collections, current methods of navigating and exploring map collections are largely restricted to catalog records and structured metadata. In this paper, we explore the potential for interactively searching large-scale map collections using natural language inputs (``maps with sea monsters''), visual inputs (i.
+
+
+
+ http://2024.computational-humanities-research/papers/paper18/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper18/
+ Quantifying Linguistic and Cultural Change in China, 1900-1950 (long paper)
Authors: Spencer Stewart
Presented in Session 5A: Linguistic Change
Paper: Download PDF
Abstract This paper presents a quantitative approach to studying linguistic and cultural change in China during the first half of the twentieth century, a period that remains understudied in computational humanities research. The dramatic changes in Chinese language and culture during this time call for greater reflection on the tools and methods used for text analysis.
+
+
+
+ http://2024.computational-humanities-research/papers/paper19/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper19/
+ Literary Time Travel: Distinguishing Past and Contemporary Worlds in Danish and Norwegian Fiction (long paper)
Authors: Jens Bjerring-Hansen, Ali Al-Laith, Daniel Hershcovich, Alexander Conroy and Sebastian Ørtoft Rasmussen
Presented in Session 2A: Literature
Paper: Download PDF
Abstract The classification of historical and contemporary novels is a nuanced task that has traditionally relied on expert literary analysis. This paper introduces a novel dataset comprising Danish and Norwegian novels from the last 30 years of the 19 th century, annotated by literary scholars to distinguish between historical and contemporary works.
+
+
+
+ http://2024.computational-humanities-research/papers/paper20/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper20/
+ Viability of Zero-Shot Classification and Search of Historical Photos (long paper)
Authors: Erika Maksimova, Mari-Anna Meimer, Mari Piirsalu and Priit Järv
Presented in Session 1A: Visual Arts and Art History
Paper: Download PDF
Abstract Multimodal neural networks are models that learn concepts in multiple modalities. The models can perform tasks like zero-shot classification: associating images with textual labels without specific training. This promises both easier and more flexible use of digital photo archives, e.
+
+
+
+ http://2024.computational-humanities-research/papers/paper21/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper21/
+ The Birth of French Orthography: A Computational Analysis of French Spelling Systems in Diachrony (long paper)
Authors: Simon Gabay and Thibault Clérice
Presented in Session 5A: Linguistic Change
Paper: Download PDF
Abstract The 17th~c. is crucial for the French language, as it sees the creation of a strict orthographic norm that largely persists to this day. Despite its significance, the history of spelling systems remains however an overlooked area in French linguistics for two reasons.
+
+
+
+ http://2024.computational-humanities-research/papers/paper30/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper30/
+ Does Context Matter? Enhancing Handwritten Text Recognition with Metadata in Historical Manuscripts (long paper)
Authors: Benjamin Kiessling and Thibault Clérice
Presented in Session 4B: Automatic Text Recognition
Paper: Download PDF
Abstract The digitization of historical manuscripts has significantly advanced in recent decades, yet many documents remain as images without machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into text, facilitating large-scale analysis of historical collections.
+
+
+
+ http://2024.computational-humanities-research/papers/paper35/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper35/
+ Enhancing Arabic Maghribi Handwritten Text Recognition with RASAM 2: A Comprehensive Dataset and Benchmarking (long paper)
Authors: Chahan Vidal-Gorène, Clément Salah, Noëmie Lucas, Aliénor Decours-Perez and Antoine Perrier
Presented in Session 4B: Automatic Text Recognition
Paper: Download PDF
Abstract Recent advancements in handwritten text recognition (HTR) for historical documents have demonstrated high performance on cursive Arabic scripts, achieving accuracy comparable to Latin scripts. The initial RASAM dataset, focused on three Arabic Maghribi manuscripts, facilitated rapid coverage of new documents via fine-tuning.
+
+
+
+ http://2024.computational-humanities-research/papers/paper36/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper36/
+ Global Coherence, Local Uncertainty: Towards a Theoretical Framework for Assessing Literary Quality (short paper)
Authors: Yuri Bizzoni, Pascale Moreira and Kristoffer Nielbo
Presented in Session 2A: Literature
Paper: Download PDF
Abstract A theoretical framework for evaluating literary quality through analyzing narrative structures using simplified narrative representations in the form of story arcs is presented. This framework proposes two complementary models: the first employs Approximate Entropy to measure local unpredictability, while the second utilizes fractal analysis to assess global coherence.
+
+
+
+ http://2024.computational-humanities-research/papers/paper39/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper39/
+ Epistemic Capture through Specialization in Post-World War II Parliamentary Debate (long paper)
Authors: Ruben Ros and Melvin Wevers
Presented in Session 7A: Social Patterns
Paper: Download PDF
Abstract This study examines specialization in Dutch Lower House debates between 1945 and 1994. We study how specialization translates in the phenomenon of ``epistemic capture'' in democratic politics. We combine topic modeling, network analysis and community detection to complement lexical ``distant reading'' approaches the history of political ideas with a network-based analysis that illuminates political-intellectual processes.
+
+
+
+ http://2024.computational-humanities-research/papers/paper42/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper42/
+ Computational Paleography of Medieval Hebrew Scripts (short paper)
Authors: Berat Kurar-Barakat, Daria Vasyutinsky-Shapira, Sharva Gogawale, Mohammad Suliman and Nachum Dershowitz
Presented in Session 6B: Multilingualism & Translation Studies
Paper: Download PDF
Abstract We present ongoing work as part of an international multidisciplinary project, called MiDRASH, on the computational analysis of medieval manuscripts. We focus here on clustering manuscripts written in Ashkenazi square script using a dataset of 206 pages from 59 manuscripts.
+
+
+
+ http://2024.computational-humanities-research/papers/paper46/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper46/
+ Models of Literary Evaluation and Web 2.0: An Annotation Experiment with Goodreads Reviews (long paper)
Authors: Simone Rebora and Gabriele Vezzani
Presented in Session 6A: Annotation
Paper: Download PDF
Abstract In the context of the Web 2.0, user-genrated reviews are becoming more and more prominent. The particular case of book reviews, often shared through digital social reading platforms such as Goodreads or Wattpad, is of particular interest, in that it offers scholars data regarding literary reception of unprecedented size and diversity.
+
+
+
+ http://2024.computational-humanities-research/papers/paper49/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper49/
+ Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media (long paper)
Authors: Ross Deans Kristensen-McLachlan, Rebecca M. M. Hicke, Márton Kardos and Mette Thunø
Presented in Session 8A: Cultural Dynamics
Paper: Download PDF
Abstract Does the People's Republic of China (PRC) interfere with European elections through ethnic Chinese diaspora media? This question forms the basis of an ongoing research project exploring how PRC narratives about European elections are represented in Chinese diaspora media, and thus the objectives of PRC news media manipulation.
+
+
+
+ http://2024.computational-humanities-research/papers/paper52/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper52/
+ Extracting social connections from Finnish Karelian refugee interviews using LLMs (long paper)
Authors: Joonatan Laato, Jenna Kanerva, John Loehr, Virpi Lummaa and Filip Ginter
Presented in Session 1B: Classification & Information Extraction
Paper: Download PDF
Abstract We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member.
+
+
+
+ http://2024.computational-humanities-research/papers/paper55/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper55/
+ Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norway’s Digitised Book Collection (short paper)
Authors: Marie Roald, Magnus Breder Birkenes and Lars Johnsen
Presented in Session 5B: Search & Discovery
Paper: Download PDF
Abstract Digital tools for text analysis have long been essential for the searchability and accessibility of digitised library collections. Recent computer vision advances have introduced similar capabilities for visual materials, with deep learning-based embeddings showing promise for analysing visual heritage.
+
+
+
+ http://2024.computational-humanities-research/papers/paper57/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper57/
+ Admiration and Frustration: A Multidimensional Analysis of Fanfiction (long paper)
Authors: Mia Jacobsen and Ross Deans Kristensen-McLachlan
Presented in Session 8B: Popular Media
Paper: Download PDF
Abstract Why do people write fanfiction? How, if at all, does fanfiction differ from the source material on which it is based? In this paper, we use quantitative text analysis to address these questions by investigating linguistic differences and similarities between fan-produced texts and their original sources.
+
+
+
+ http://2024.computational-humanities-research/papers/paper59/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper59/
+ Recognizing Non-named Spatial Entities in Literary Texts: A Novel Spatial Entities Classifier (short paper)
Authors: Daniel Kababgi, Giulia Grisot, Federico Pennino and Berenike Herrmann
Presented in Session 2A: Literature
Paper: Download PDF
Abstract Predicting spatial representations in literature is a challenging task that requires advanced machine learning methods and manual annotations. In this paper, we present a study that leverages manual annotations and a BERT language model to automatically detect and recognise non-named spatial entities in a historical corpus of Swiss novels.
+
+
+
+ http://2024.computational-humanities-research/papers/paper6/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper6/
+ Quantitative Framework for Word-Color Association and Application to 20th Century Anglo-American Poetry (long paper)
Authors: Sungpil Wang and Juyong Park
Presented in Session 2B: Semantic Analysis
Paper: Download PDF
Abstract Color symbolism is considered a critical element in art and literature, yet determining the relationship between colors and words has remained largely subjective. This research presents a systematic methodology for quantifying the correlation between language and color. We utilize text-based image search, optical character recognition (OCR), and advanced image processing techniques to establish a connection between words and their corresponding color distributions in the CIELch color space.
+
+
+
+ http://2024.computational-humanities-research/papers/paper60/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper60/
+ SCIENCE IS EXPLORATION: Computational Frontiers for Conceptual Metaphor Theory (short paper)
Authors: Rebecca M. M. Hicke and Ross Deans Kristensen-McLachlan
Presented in Session 5A: Linguistic Change
Paper: Download PDF
Abstract Metaphors are everywhere. They appear extensively across all domains of natural language, from the most sophisticated poetry to seemingly dry academic prose. A significant body of research in the cognitive science of language argues for the existence of conceptual metaphors, the systematic structuring of one domain of experience in the language of another.
+
+
+
+ http://2024.computational-humanities-research/papers/paper61/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper61/
+ Bootstrap Distance Imposters: High Precision Authorship Verification with Improved Interpretability (long paper)
Authors: Ben Nagy
Presented in Session 3B: Stylometry
Paper: Download PDF
Abstract This paper describes an update to the open-source Python implementation of the General Imposters method of authorship verification by Mike Kestemont et al. The new algorithm, called Bootstrap Distance Imposters (henceforth BDI), incorporates a key improvement introduced by Potha and Stamatatos, as well as introducing a novel method of bootstrapping that has several attractive properties when compared to the reference algorithm.
+
+
+
+ http://2024.computational-humanities-research/papers/paper62/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper62/
+ Addressing Uncertainty According to the Annotator’s Expertise in Archaeological Data Collections: an Approach from Fuzzy Logic (short paper)
Authors: Patricia Martin-Rodilla and Leticia Tobalina
Presented in Session 6A: Annotation
Paper: Download PDF
Abstract Archaeological data allow us to synthetically represent the past of individuals and communities over time. This complex representation task requires an amalgamation of variables and makes the intrinsic data vagueness. The study of vagueness as an archaeological data dimension has become a dynamic focus of archaeologists' work in recent years, presenting theoretical and practical approaches for the representation, mainly with fuzzy logic, of archaeological variables.
+
+
+
+ http://2024.computational-humanities-research/papers/paper67/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper67/
+ In the Context of Narrative, We Never Properly Defined the Concept of Valence (long paper)
Authors: Peter Boot, Angel Daza, Carsten Schnober and Willem van Hage
Presented in Session 7B: Measuring Emotion & Sentiment
Paper: Download PDF
Abstract Valence is a concept that is increasingly being used in the computational study of narrative texts. We discuss the history of the concept and show that the word has been interpreted in various ways.
+
+
+
+ http://2024.computational-humanities-research/papers/paper70/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper70/
+ Locating the Leading Edge of Cultural Change (short paper)
Authors: Sarah Griebel, Becca Cohen, Lucian Li, Jiayu Liu, Jaihyun Park, Jana Perkins and Ted Underwood
Presented in Session 8A: Cultural Dynamics
Paper: Download PDF
Abstract Measures of textual similarity and divergence are increasingly used to study cultural change. But which measures align, in practice, with social evidence about change? We apply three different representations of text (topic models, document embeddings, and word-level perplexity) to three different corpora (literary studies, economics, and fiction).
+
+
+
+ http://2024.computational-humanities-research/papers/paper71/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper71/
+ Transformation of Composition and Gaze Interaction in Noli Me Tangere Depictions from 1300–1600 (short paper)
Authors: Pepe Ballesteros Zapata, Nina Arnold, Vappu Vilhelmiina Lukander, Ludovica Schaerf and Dario Negueruela del Castillo
Presented in Session 1A: Visual Arts and Art History
Paper: Download PDF
Abstract This paper examines the development of figure composition and gaze dynamics between Mary Magdalene and Christ in Italian noli me tangere depictions from 1300 to 1600 in the context of the emergence of perspective painting.
+
+
+
+ http://2024.computational-humanities-research/papers/paper73/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper73/
+ Page Embeddings: Extracting and Classifying Historical Documents with Generic Vector Representations (short paper)
Authors: Carsten Schnober, Renate Smit, Manjusha Kuruppath, Kay Pepping, Leon van Wissen and Lodewijk Petram
Presented in Session 1B: Classification & Information Extraction
Paper: Download PDF
Abstract We propose a neural network architecture designed to generate region and page embeddings for boundary detection and classification of documents within a large and heterogeneous historical archive. Our approach is versatile and can be applied to other tasks and datasets.
+
+
+
+ http://2024.computational-humanities-research/papers/paper74/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper74/
+ Direct and Indirect Annotation with Generative AI: A Case Study into Finding Animals and Plants in Historical Text (short paper)
Authors: Arjan van Dalfsen, Folgert Karsdorp, Ayoub Bagheri, Thirza van Engelen, Dieuwertje Mentink and Els Stronks
Presented in Session 6A: Annotation
Paper: Download PDF
Abstract This study explores the use of generative AI (GenAI) for annotation in the humanities, comparing direct and indirect annotation approaches with human annotations. Direct annotation involves using GenAI to annotate the entire corpus, while indirect annotation uses GenAI to create training data for a specialized model.
+
+
+
+ http://2024.computational-humanities-research/papers/paper75/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper75/
+ Combining Automatic Annotation Tools with Human Validation for the Semantic Enrichment of Cultural Heritage Metadata (long paper)
Authors: Eirini Kaldeli, Alexandros Chortaras, Vassilis Lyberatos, Jason Liartis, Spyridon Kantarelis and Giorgos Stamou
Presented in Session 6A: Annotation
Paper: Download PDF
Abstract The addition of controlled terms from linked open datasets and vocabularies to metadata can increase the discoverability and accessibility of digital collections. However, the task of semantic enrichment requires a lot of effort and resources that cultural heritage organizations often lack.
+
+
+
+ http://2024.computational-humanities-research/papers/paper76/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper76/
+ Literary Canonicity and Algorithmic Fairness: The Effect of Author Gender on Classification Models (long paper)
Authors: Ida Marie S. Lassen, Pascale Feldkamp, Yuri Bizzoni and Kristoffer Nielbo
Presented in Session 3A: Literary Canon & Reception
Paper: Download PDF
Abstract This study examines gender biases in machine learning models that predict literary canonicity. Using algorithmic fairness metrics like equality of opportunity, equalised odds, and calibration within groups, we show that models violate the fairness metrics, especially by misclassifying non-canonical books by men as canonical.
+
+
+
+ http://2024.computational-humanities-research/papers/paper79/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper79/
+ Exploration of Event Extraction Techniques in Late Medieval and Early Modern Administrative Records (short paper)
Authors: Ismail Prada Ziegler
Presented in Session 1B: Classification & Information Extraction
Paper: Download PDF
Abstract While an increasing amount of studies exploring named entity recognition in historical corpora are published, application of other information extraction tasks such as event extraction remains scarce. This study explores two accessible methods to facilitate the detection of events and the classification of entities into roles: rule-based systems and RNN-based machine learning techniques.
+
+
+
+ http://2024.computational-humanities-research/papers/paper82/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper82/
+ Assessing Landscape Intervisibility and Prominence at Regional Scale (short paper)
Authors: Adela Sobotkova
Presented in Session 1A: Visual Arts and Art History
Paper: Download PDF
Abstract Visibility and intervisibility have been important aspects of spatial analysis in landscape archaeological studies, but remain hampered by computational intensity, small-scale study area, edge effects, and bare-earth digital elevation models. This paper assesses intervisibility and prominence in a dataset of over 1000 burial mounds in the Middle Tundzha River watershed in Bulgaria.
+
+
+
+ http://2024.computational-humanities-research/papers/paper86/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper86/
+ Univariate Statistical Analysis of a Non-Canonical Literary Genre: Quantifying German-Language One-Act Plays (1740–1850) (long paper)
Authors: Viktor J. Illmer, Dîlan Canan Çakir, Frank Fischer and Lilly Welz
Presented in Session 3A: Literary Canon & Reception
Paper: Download PDF
Abstract This article explores the use of metadata to analyse German-language one-act plays from 1740 to 1850, addressing the need to expand beyond canonical texts in literary studies. Utilising the Database of German-Language One-Act Plays, we examine aspects such as the number of scenes and characters as well as the role of different original languages on which the translated plays in the corpus are based.
+
+
+
+ http://2024.computational-humanities-research/papers/paper9/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper9/
+ Multilingual Stylometry: The influence of language and corpus composition on the performance of authorship attribution using corpora from the European Literary Text Collection (ELTeC) (long paper)
Authors: Christof Schoech, Julia Dudar, Evgeniia Fileva and Artjoms Šeļa
Presented in Session 3B: Stylometry
Paper: Download PDF
Abstract Stylometric authorship attribution is concerned with the task of assigning texts of unknown, pseudonymous or disputed authorship to their most likely author, often based on a comparison of the frequency of a selected set of features that represent the texts.
+
+
+
+ http://2024.computational-humanities-research/papers/paper90/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper90/
+ Animacy in German Folktales (short paper)
Authors: Julian Häußler, Janis von Keitz and Evelyn Gius
Presented in Session 2A: Literature
Paper: Download PDF
Abstract This paper explores the phenomenon of animacy in prose by the example of German folktales. We present a manually annotated corpus of 19 German folktales from the Brothers Grimm collection and train a classifier on these annotations. Building on previous work in animacy detection, we evaluate the classifier’s performance and its application to a larger corpus.
+
+
+
+ http://2024.computational-humanities-research/papers/paper92/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper92/
+ Domain Adaptation with Linked Encyclopedic Data: A Case Study for Historical German (long paper)
Authors: Thora Hagen
Presented in Session 2B: Semantic Analysis
Paper: Download PDF
Abstract This paper outlines a proposal for the use of knowledge graphs for historical German domain adaptation. From the EncycNet project, the encyclopedia-based knowledge graph from the early 20th century was borrowed to examine whether text-based domain adaptation using the source encyclopedia's text or graph-based adaptation produces a better domain-specific model.
+
+
+
+ http://2024.computational-humanities-research/papers/paper93/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper93/
+ And then I saw it: Testing Hypotheses on Turning Points in a Corpus of UFO Sighting Reports (short paper)
Authors: Jan Langenhorst, Robert C. Schuppe and Yannick Frommherz
Presented in Session 7A: Social Patterns
Paper: Download PDF
Abstract As part of developing a Computational Narrative Understanding, modeling events within stories has recently received significant attention within the digital humanities community. Most of the current research aims at good performance when predicting events.
+
+
+
+ http://2024.computational-humanities-research/papers/paper94/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper94/
+ Revolution + Love: Measuring the Entanglements of Violence and Emotions in Post-1949 China (short paper)
Authors: Maciej Kurzynski
Presented in Session 7A: Social Patterns
Paper: Download PDF
Abstract This paper examines the relationship between violent discourse and emotional intensity in the early revolutionary rhetoric of the People's Republic of China (PRC). Using two fine-tuned bert-base-chinese models—one for detecting violent content in texts and another for assessing their affective charge—we analyze over 185,000 articles published between 1956 and 1989 in the People's Liberation Army Daily ( Jiefangjun Bao ), the official journal of China's armed forces.
+
+
+
+ http://2024.computational-humanities-research/papers/paper95/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper95/
+ Tracing the Development of the Virtual Particle Concept Using Semantic Change Detection (long paper)
Authors: Michael Zichert and Adrian Wüthrich
Presented in Session 2B: Semantic Analysis
Paper: Download PDF
Abstract Virtual particles are peculiar objects. They figure prominently in much of theoretical and experimental research in elementary particle physics. But exactly what they are is far from obvious. In particular, to what extent they should be considered real remains a matter of controversy in philosophy of science.
+
+
+
+ http://2024.computational-humanities-research/papers/paper96/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper96/
+ Remember to forget: A study on verbatim memorization of literature in Large Language Models (long paper)
Authors: Xinhao Zhang, Olga Seminck and Pascal Amsili
Presented in Session 4A: Large Language Models
Paper: Download PDF
Abstract We examine the extent to which English and French literature is memorized by freely accessible LLMs, using a name cloze inference task (which focuses on the model's ability to recall proper names from a book).
+
+
+
+ http://2024.computational-humanities-research/papers/paper97/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper97/
+ Latent Structures of Intertextuality in French Fiction (short paper)
Authors: Jean Barré
Presented in Session 2A: Literature
Paper: Download PDF
Abstract Intertextuality is a key concept in literary theory that challenges traditional notions of text, signification or authorship. It views texts as part of a vast intertextual network that is constantly evolving and being reconfigured. This paper argues that the field of computational literary studies is the ideal place to conduct a study of intertextuality since we have now the ability to systematically compare texts with each others.
+
+
+
+ http://2024.computational-humanities-research/papers/paper98/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/paper98/
+ Sentiment Below the Surface: Omissive and Evocative Strategies in Literature and Beyond (long paper)
Authors: Pascale Feldkamp, Ea Lindhardt Overgaard, Kristoffer Nielbo and Yuri Bizzoni
Presented in Session 7B: Measuring Emotion & Sentiment
Paper: Download PDF
Abstract As they represent one of the most complex forms of expression, literary texts continue to challenge Sentiment Analysis (SA) tools, often developed for other domains. At the same time, SA is becoming an increasingly central method in literary analysis itself, which raises the question of what are the challenges inherent to literary SA.
+ About
http://2024.computational-humanities-research/about/
@@ -85,6 +484,13 @@
http://2024.computational-humanities-research/about/ The Computational Humanities Research (CHR) community is an international and interdisciplinary community that supports researchers with an interest in computational approaches to the humanities. Learn more about the people behind CHR2024 and the organisation of CHR: People → Community → Archive →
+
+ Accepted Papers
+ http://2024.computational-humanities-research/papers/
+ Mon, 01 Jan 0001 00:00:00 +0000
+ http://2024.computational-humanities-research/papers/
+ Session 1A: Visual Arts and Art History Viability of Zero-Shot Classification and Search of Historical PhotosErika Maksimova, Mari-Anna Meimer, Mari Piirsalu and Priit Järv
Transformation of Composition and Gaze Interaction in Noli Me Tangere Depictions from 1300–1600Pepe Ballesteros Zapata, Nina Arnold, Vappu Vilhelmiina Lukander, Ludovica Schaerf and Dario Negueruela del Castillo
Deciphering Still Life Artworks with Linked Open DataBruno Sartini
Computational Segmentation of Wayang Kulit Video Recordings using a Cross-Attention Temporal ModelShawn Hong Wei Liew and Miguel Escobar Varela
+ Accomodation in Aarhus
http://2024.computational-humanities-research/venue/accomodation-in-aarhus/
@@ -104,7 +510,7 @@
http://2024.computational-humanities-research/programme/
Mon, 01 Jan 0001 00:00:00 +0000http://2024.computational-humanities-research/programme/
- Programme for the pre-conference workshops on Tuesday, 3rd December 2024, and the main conference days on Wednesday, Thursday, and Friday, 4th-6th December 2024.
Please note that all times are in Central European Time (CET)
Tuesday Wednesday Thursday Friday Tuesday, December 3, 2024 (Pre-conference workshops) 09:00-12:30: Workshop sessions 12:30-13:30: Lunch 13:30-17:00: Workshop sessions Parallel Workshops Digital Methods for Mythological Research dm4myth aims to bring together researchers from various disciplines who are interested in studying myths with digital tools and methods.
+ Programme for the pre-conference workshops on Tuesday, 3rd December 2024, and the main conference days on Wednesday, Thursday, and Friday, 4th-6th December 2024.
You can also get an overview of all accepted papers here.
Note: Changes may occur in the programme. Please check regularly for the latest information.
Tuesday Wednesday Thursday Friday Tuesday, December 3, 2024 (Pre-conference workshops) Note: Times are in Central European Time (CET)
09:00-12:30: Workshop sessions 12:30-13:30: Lunch 13:30-17:00: Workshop sessions Parallel Workshops Digital Methods for Mythological Research dm4myth aims to bring together researchers from various disciplines who are interested in studying myths with digital tools and methods.Conference Dinner at Restaurant Havnær
diff --git a/docs/papers/index.html b/docs/papers/index.html
new file mode 100644
index 0000000..5ffa7e3
--- /dev/null
+++ b/docs/papers/index.html
@@ -0,0 +1,223 @@
+
+
+
+
+ Computational Humanities Research 2024
+
+
+
+
+
+
+
+
+
+
+
Many cultural institutions have made large digitized visual collections available online, often under permissible re-use licences. Creating interfaces for exploring and searching these collections is difficult, particularly in the absence of granular metadata. In this paper, we introduce a method for using state-of-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections. We show how our approach can create novel clustering and recommendation systems that avoid common pitfalls of methods based directly on visual embeddings. Of particular interest is the ability to offer concrete textual explanations of each recommendation without the need to preselect the features of interest. Together, these features can create a digital interface that is more open-ended and flexible while also being better suited to addressing privacy and ethical concerns. Through a case study using a collection of documentary photographs, we provide several metrics showing the efficacy and possibilities of our approach.
Live music concert analysis provides an opportunity to explore cultural and historical trends. The art of set-list construction, of which songs to play, has many considerations for an artist, and the notion of how much variety different artists play is an interesting topic. Online communities provide rich crowd-sourced encyclopaedic data repositories of live concert set-list data, facilitating the potential for quantitative analysis of live music concerts. In this paper, we explore data acquisition and processing of musical artists' tour histories and propose an approach to analyse and explore the notion of variety, at individual tour level, at artist career level, and for comparisons between a corpus of artists from different musical genres. We propose notions of a shelf and a tail as a means to help explore tour variety and explore how they can be utilised to help define a single metric of variety at tour level, and artist level. Our analysis highlights the wide diversity among artists in terms of their inclinations toward variety, whilst correlation analysis demonstrates how our measure of variety remains robust across differing artist attributes, such as the number of tours and show lengths.
This study focuses on the problem of multilingual medieval text alignment, which presents specific challenges, due to the absence of modern punctuation in the texts and the non-standard forms of medieval languages. In order to perform the alignment of several witnesses from the multilingual tradition of the prose Lancelot, we first develop an automatic text segmenter based on BERT and then align the produced segments using Bertalign. This alignment is then used to produce stemmatological hypotheses, using phylogenetic methods. The aligned sequences are clustered independently by two human annotators and a clustering algorithm (DBScan), and the resulting variant tables submitted to maximum parsimony analysis, in order to produce trees. The trees are then compared and discussed in light of philological knowledge. Results tend to show that automatically clustered sequences can provide results comparable to those of human annotation.
Recent work on the textual features linked to literary quality has primarily focused on commercially published literature, such as canonical or best-selling novels, that are systematically filtered by editorial and market mechanisms. However, the biggest repositories of fiction texts currently in existence are free fanfiction websites, where fans post fictional stories about their favorite characters for the pleasure of writing and engaging with others. This makes them a particularly interesting domain to study the patterns of perceived quality ``in the wild'', where text-reader relations are less filtered. Moreover, since fanfiction is a community-built domain with its own conventions, comparing it to published literature can more generally provide insights into the reception and perceived quality of published literature itself. Taking a novel approach to the study of fanfiction, we observe whether three textual features associated with perceived literary quality in published texts are also relevant in the context of fanfiction. Using different reception proxies, we find that despite the differences of fanfiction from published literature, some ``patterns of quality'' associated with positive reception appear to hold similar effects in both of these contexts of literary production.
Digital libraries and databases of texts are invaluable resources for researchers, yet their reliance on printed editions can lead to significant gaps and potentially exclude works without printed reproductions. The Simtho database of Syriac serves as a pertinent example: it is derived primarily from OCR of scholarly editions, but how representative are these of the language's extensive literary tradition, transmitted and preserved in manuscript form for centuries? Taking the Simtho database and a selection of the Vatican Library's Syriac manuscript collection as a case study, we propose a pipeline that aligns a corpus of e-texts with a set of digitised manuscript images, in order to ascertain the presence or absence of texts between the e-text and manuscript corpora and thus contribute to their enrichment. We delve into the complexities of this task, evaluating both effective tools for alignment and approaches to detect factors that can contribute to alignment failures. This case study is intended as a first step towards foundational methodologies applicable to larger-scale manuscript processing efforts.
In this work, we survey the way in which classification is used as a sensemaking practice in cultural analytics, and assess where large language models can fit into this landscape. We identify ten tasks supported by publicly available datasets on which we empirically assess the performance of LLMs compared to traditional supervised methods, and explore the ways in which LLMs can be employed for sensemaking goals beyond mere accuracy. We find that prompt-based LLMs are competitive with traditional supervised models for established tasks, but perform less well on de novo tasks. In addition, LLMs can assist sensemaking by acting as an intermediary input to formal theory testing.
Applying stylometry to Authorship Attribution requires distilling the elements of an author's style sufficient to recognise their mark in anonymous documents. Often, this is accomplished by contrasting the frequency of selected features in the authors' works. A recent approach, CP2D, uses innovation processes to infer the author's identity, accounting for their propensity to introduce new elements. In this paper, we apply CP2D to a corpus of Classical Latin texts to test its effectiveness in a new context and explore the additional insight it can offer the scholar. We show its effectiveness on a corpus of classical Latin texts and how---moving beyond maximum likelihood---we can visualise the stylistic relationships and gather additional information on the relationships among documents.
Generating poetry has become a popular application of LLMs, perhaps especially of OpenAI's widely-used chatbot ChatGPT. What kind of poet is ChatGPT? Does ChatGPT have its own poetic style? Can it successfully produce poems in different styles? To answer these questions, we prompt the GPT-3.5 and GPT-4 models to generate English-language poems in 24 different poetic forms and styles, about 40 different subjects, and in response to 3 different writing prompt templates. We then analyze the resulting 5.7k poems, comparing them to a sample of 3.7k poems from the Poetry Foundation and the Academy of American Poets. We find that the GPT models, especially GPT-4, can successfully produce poems in a range of both common and uncommon English-language forms in superficial yet noteworthy ways, such as by producing poems of appropriate lengths for sonnets (14 lines), villanelles (19 lines), and sestinas (39 lines). But the GPT models also exhibit their own distinct stylistic tendencies, both within and outside of these specific forms. Our results show that GPT poetry is much more constrained and uniform than human poetry, showing a strong penchant for rhyme, quatrains (4-line stanzas), iambic meter, first-person plural perspectives (we, us, our), and specific vocabulary like ``heart,'' ``embrace,'' ``echo,'' and ``whisper.''
The still life genre is a good example of how even the simplest elements depicted in an artwork can be carriers of deeper, symbolic meanings that influence the overall artistic interpretation of it. In this paper, we present an ongoing study on the use of linked open data (LOD) to quantitatively analyze the symbolic meanings of still life paintings. In particular, we propose two different experiments based on (i) the theory of the art historian Bergström, and (ii) the impact of the Floriography movement in still life. To do so, we extract and combine data from Wikidata, HyperReal, IICONGRAPH, and the ODOR dataset. This work shows promising results about the use of LOD for art-historical quantitative research, as we are able to confirm Bergström's theory and to pinpoint outliers in the Floriography context that can be the objects of specific, qualitative studies. We conclude the paper by reflecting on the current limitations surrounding art-historical data.
Narrative film is a composition of writing, cinematography, editing, and performance. While much computational work has focused on the writing or visual style in film, we conduct in this paper a computational exploration of acting performance. Applying speech emotion recognition models and a variationist sociolinguistic analytical framework to a corpus of popular, contemporary American film, we find narrative structure, diachronic shifts, and genre- and dialogue-based constraints located in spoken performances.
This paper investigates the application of translation alignment algorithms in the creation of a Multilingual Digital Edition (MDE) of Alessandro Manzoni's Italian novel I promessi sposi (``The Betrothed''), with translations in eight languages (English, Spanish, French, German, Dutch, Polish, Russian and Chinese) from the 19th and 20th centuries. We identify key requirements for the MDE to improve both the reader experience and support for translation studies. Our research highlights the limitations of current state-of-the-art algorithms when applied to the translation of literary texts and outlines an automated pipeline for MDE creation. This pipeline transforms raw texts into web-based, side-by-side representations of original and translated texts with different rendering options. In addition, we propose new metrics for evaluating the alignment of literary translations and suggest visualization techniques for future analysis.
Unseen species models from ecology have recently been applied to censored historical cultural datasets to estimate unobserved populations. We extend this approach to historical criminology, analyzing the police registers of Brussels' Amigo prison (1879-1880) using the Generalized Chao estimator. Our study aims to quantify the `dark number' of unarrested perpetrators and model demographic biases in policing efforts. We investigate how factors such as age, gender, and origin influence arrest vulnerability. While all examined covariates contribute positively to our model, their small effect sizes limit the model's predictive performance. Our findings largely align with prior historical scholarship but suggest that demographic factors alone may insufficiently explain arrest patterns. The Generalized Chao estimator modestly improves population size estimates compared to simpler models. However, our results indicate that more refined models or additional data may be necessary for robust estimates in historical criminological studies. This work contributes to the growing field of computational methods in humanities research and offers insights into the challenges of quantifying hidden populations in historical datasets.
Television is often seen as a site for subcultural identification and subversive fantasy, including in queer cultures. How might we measure subversion, or the degree to which the depiction of social relationship between a dyad (e.g. two characters who are colleagues) deviates from its typical representation on TV? To explore this question, we introduce the task of stereotypic relationship extraction. Built on cognitive stylistics, linguistic anthropology, and dialogue relation extraction, in this paper, we attempt to model the cognitive process of stereotyping TV characters in dialogic interactions: given a dyad, we want to predict: what social relationship do the speakers exhibit through their words? Subversion is then characterized by the discrepancy between the distribution of the model's predictions and the ground truth labels. To demonstrate the usefulness of this task and gesture at a methodological intervention, we enclose four case studies to characterize the representation of queer relationalities in the Big Bang Theory, Frasier, and Gilmore Girls as we explore the suspicious and reparative modes of reading with our computational methods.
Digital games are a complex multimodal phenomenon that is examined in a variety of ways by the highly interdisciplinary field of game studies. In this article, we focus on the structural aspect of the diegetic language of games and examine the extent to which established methods of computational drama analysis can also be successfully applied to digital games. Initial experiments show that both games and drama texts have an inventory of characters that drive the plot forward. In dramas, this plot is usually subdivided into individual acts and scenes. In games, however, such systematic segmentation is the exception rather than the rule, or if it is present, it is implemented very differently in different games. In this paper, we therefore focus on exploring alternative ways of making scene-like structures in game dialogs identifiable with the help of computers. As a result of these experiments, exciting future perspectives emerge that raise the question of whether computer-aided methods of scene recognition, which are inspired by media such as games and films, can also be applied to classical dramas in the future in order to fundamentally question their historical-editorial scene classification.
With this paper we aim to assess whether Early Modern book titles can be exploited to track two aspects of multilingualism in book publishing: publications featuring multiple languages and the distinction between editions of works in their original language and in translation. To this scope we leverage the manually annotated language information available in two book catalogs: the Collectio Academica Antiqua, recording publications of scholars of the Old University of Leuven (1425-1797) and a subset of the Eighteenth Century Collections Online, namely publications of Ancient Greek and Latin works. We evaluate three different approaches: we train a simple tf-idf based support vector classifier, we fine-tune a multilingual transformer model (BERT) and we use a few-shot approach with a pre-trained sentence transformer model. In order to get a better understanding of the results, we make use of SHAP, a library for explaining the output of any machine Learning model. We conclude that while the few-shot prediction is not currently usable for this task, the tf-idf approach and BERT fine-tuning are comparable and both usable. BERT shows better results for the task of identifying translations and when generalizing across different datasets.
In this study, we use the development of emotions in German-language poetry, drama, and prose from 1850 to 1920 to informally test three hypotheses about literature: (1) Literature is a unified field, and therefore genres develop similarly at the same time. (2) The development of literature is led by one genre while the others follow. (3) The three main genres have very different developments without any relation to each other. We look at the development of emotions in these genres in general, and then at more fine-grained levels: polarity, six groups of emotions, and the group of love emotions. In the end, our data cannot confirm any of these hypotheses, but do show a closer relationship between poetry and prose, while drama shows a very distinct development. Only in some specific cases, such as the representation of lust and of love, can we see a closer relationship between the genres in general.
We report preliminary findings on a novel approach to automatically segment Javanese wayang kulit (traditional leather puppet) performances using computational methods. We focus on identifying comic interludes, which have been the subject of scholarly debate regarding their increasing duration. Our study employs action segmentation techniques from a Cross-Attention Temporal Model, adapting methods from computer vision to the unique challenges of wayang kulit videos. We manually labelled 100 video recordings of performances to create a dataset for training and testing our model. These videos, which are typically 7 hours long, were sampled from our comprehensive dataset of 12,638 videos uploaded to a video platform between 03 Jun 2012 and 30 Dec 2023. The resulting algorithm achieves an accuracy of 89.06 % in distinguishing between comic interludes and regular performance segments, with F1-scores of 96.53 %, 95.91 %, and 92.47 % at overlapping thresholds of 10 %, 25 %, and 50 % respectively. This work demonstrates the potential of computational approaches in analyzing traditional performing arts and other video material, offering new tools for quantitative studies of audiovisual cultural phenomena, and provides a foundation for future empirical research on the evolution of wayang kulit performances.
This research examines the Carthusian monastery of Herne, a major cultural hotspot during the Middle Ages. Between 1350 and 1400, the monks residing in Herne produced an impressive 46 production units, with 40 of them written in the Middle Dutch vernacular. Focusing on the monastery's most productive scribe, known as the Speculum Scribe, this case study employs methods from the field of scribal modelling to achieve two main objectives: first, to evaluate the potential for chronologically ordering the Speculum Scribe’s works based on his use of abbreviations, and second, to investigate whether there was a convergence in scribal practices, such as the use of abbreviations, among the scribes living in Herne. Although a complete chronological order of the Speculum Scribe's works could not be determined, we were able to establish his first work. Furthermore, the findings show evidence that cautiously supports the second goal, suggesting that the scribes in Herne indeed converged in their scribal habits by learning from each other.
Despite the prevalence and historical importance of maps in digital collections, current methods of navigating and exploring map collections are largely restricted to catalog records and structured metadata. In this paper, we explore the potential for interactively searching large-scale map collections using natural language inputs (``maps with sea monsters''), visual inputs (i.e., reverse image search), and multimodal inputs (an example map + ``more grayscale''). As a case study, we adopt 562,842 images of maps publicly accessible via the Library of Congress's API. To accomplish this, we use the mulitmodal Contrastive Language-Image Pre-training (CLIP) machine learning model to generate embeddings for these maps, and we develop code to implement exploratory search capabilities with these input strategies. We present results for example searches created in consultation with staff in the Library of Congress's Geography and Map Division and describe the strengths, weaknesses, and possibilities for these search queries. Moreover, we introduce a fine-tuning dataset of 10,504 map-caption pairs, along with an architecture for fine-tuning a CLIP model on this dataset. To facilitate re-use, we provide all of our code in documented, interactive Jupyter notebooks and place all code into the public domain. Lastly, we discuss the opportunities and challenges for applying these approaches across both digitized and born-digital collections held by galleries, libraries, archives, and museums.
This paper presents a quantitative approach to studying linguistic and cultural change in China during the first half of the twentieth century, a period that remains understudied in computational humanities research. The dramatic changes in Chinese language and culture during this time call for greater reflection on the tools and methods used for text analysis. This preliminary study offers a framework for analyzing Chinese texts from the late nineteenth and twentieth centuries, demonstrating how established methods such as word counts and word embeddings can provide new historical insights into the complex negotiations between Western modernity and Chinese cultural discourse.
The classification of historical and contemporary novels is a nuanced task that has traditionally relied on expert literary analysis. This paper introduces a novel dataset comprising Danish and Norwegian novels from the last 30 years of the 19 th century, annotated by literary scholars to distinguish between historical and contemporary works. While this manual classification is time-consuming and subjective, our approach leverages pre-trained language models to streamline and potentially standardize this process. We evaluate their effectiveness in automating this classification by examining their performance on titles and the first few sentences of each novel. After fine-tuning, the models show good performance but fail to fully capture the nuanced understanding exhibited by literary scholars. This research underscores the potential and limitations of NLP in literary genre classification and suggests avenues for further improvement, such as incorporating more sophisticated model architectures or hybrid methods that blend machine learning with expert knowledge. Our findings contribute to the broader field of computational humanities by highlighting the challenges and opportunities in automating literary analysis.
Multimodal neural networks are models that learn concepts in multiple modalities. The models can perform tasks like zero-shot classification: associating images with textual labels without specific training. This promises both easier and more flexible use of digital photo archives, e.g. annotating and searching. We investigate whether existing multimodal models can perform these tasks, when the data differs from the typical computer vision training sets, on historical photos from a cultural context outside the English speaking world.
The 17th~c. is crucial for the French language, as it sees the creation of a strict orthographic norm that largely persists to this day. Despite its significance, the history of spelling systems remains however an overlooked area in French linguistics for two reasons. On the one hand, spelling is made up of micro-changes which requires a quantitative approach, and on the other hand, no corpus is available due to the interventions of editors in almost all the texts already available. In this paper, we therefore propose a new corpus allowing such a study, as well as the extraction and analysis tools necessary for our research. By comparing the text extracted with OCR and a version automatically aligned with contemporary French spelling, we extract the variant zones, we categorise these variants, and we observe their frequency to study the (ortho)graphic change during the 17th century.
The digitization of historical manuscripts has significantly advanced in recent decades, yet many documents remain as images without machine-readable text. Handwritten Text Recognition (HTR) has emerged as a crucial tool for converting these images into text, facilitating large-scale analysis of historical collections. In 2024, the CATMuS Medieval dataset was released, featuring extensive diachronic coverage and a variety of languages and script types. Previous research indicated that model performance degraded on the best manuscripts over time as more data was incorporated, likely due to over-generalization. This paper investigates the impact of incorporating contextual metadata in training HTR models using the CATMuS Medieval dataset to mitigate this effect. Our experiments compare the performance of various model architectures, focusing on Conformer models with and without contextual inputs, as well as Conformer models trained with auxiliary classification tasks. Results indicate that Conformer models utilizing semantic contextual tokens (Century, Script, Language) outperform baseline models, particularly on challenging manuscripts. The study underscores the importance of metadata in enhancing model accuracy and robustness across diverse historical texts.
Recent advancements in handwritten text recognition (HTR) for historical documents have demonstrated high performance on cursive Arabic scripts, achieving accuracy comparable to Latin scripts. The initial RASAM dataset, focused on three Arabic Maghribi manuscripts, facilitated rapid coverage of new documents via fine-tuning. However, HTR application for Arabic scripts remains constrained due to the vast diversity in spellings, ambiguities, and languages. To overcome these challenges, we present RASAM~2, an extended dataset with 3,750 lines from 15 manuscripts in the BULAC library, showcasing various hands, layouts, and texts in Arabic Maghribi script. RASAM~2 aims to establish a new benchmark for HTR model training for both Maghribi and Oriental scripts, covering text recognition and layout analysis. Preliminary experiments using a word-based CRNN approach indicate significant model versatility, with a nearly 40 % reduction in Character Error Rate (CER) across new in-domain and out-of-domain manuscripts.
A theoretical framework for evaluating literary quality through analyzing narrative structures using simplified narrative representations in the form of story arcs is presented. This framework proposes two complementary models: the first employs Approximate Entropy to measure local unpredictability, while the second utilizes fractal analysis to assess global coherence. When applied to a substantial corpus of 9,089 novels, the findings indicate that narratives characterized by high literary quality, as indicated by reader ratings, exhibit a balance of local unpredictability and global coherence. This dual approach provides a formal and empirical basis for assessing literary quality and emphasizes the importance of considering intrinsic properties and reader perception in literary studies.
This study examines specialization in Dutch Lower House debates between 1945 and 1994. We study how specialization translates in the phenomenon of ``epistemic capture'' in democratic politics. We combine topic modeling, network analysis and community detection to complement lexical ``distant reading'' approaches the history of political ideas with a network-based analysis that illuminates political-intellectual processes. We demonstrate how the breadth of political debate declines as its specialist depth increases. To study this transformation, we take a multi-level approach. At the (institutional) macro-level, network modularity shows an increase in the modularity of topic linkage networks, indicating growing specialization post-1960, linked to institutional reforms. At the (political) meso-level, we similarly observe specialization in node neighborhood stability, but also variation as the consequence of ideological and party political change. Lastly, micro-level analysis reveals persistent thematic communities tied to increasingly stable groups of individuals, revealing how policy domains and politicians are captured in ossified specialisms. As such, this study provides new insights into the development of twentieth-century political debate and emergent tensions between pluralism and specialism.
We present ongoing work as part of an international multidisciplinary project, called MiDRASH, on the computational analysis of medieval manuscripts. We focus here on clustering manuscripts written in Ashkenazi square script using a dataset of 206 pages from 59 manuscripts. Collaborating with expert paleographers, we identified ten critical features and trained a multi-label CNN, achieving high accuracy in feature prediction. This should make it possible to computationally predict the subclusters already known to paleographers and those yet to be discovered. We identified visible clusters using PCA and ^2 feature selection. In future work, we aim to enhance feature extraction using deep learning algorithms and provide computational tools to ease paleographers' work. We plan to develop new methodologies for analyzing Hebrew scripts and refining our understanding of medieval Hebrew manuscripts.
In the context of the Web 2.0, user-genrated reviews are becoming more and more prominent. The particular case of book reviews, often shared through digital social reading platforms such as Goodreads or Wattpad, is of particular interest, in that it offers scholars data regarding literary reception of unprecedented size and diversity. In this paper, we test whether the evaluative criteria employed in Goodreads reviews can be included in the framework of traditional literary criticism, by combining literary theory and computational methods. Our model, based on the work of von Heydebrand and Winko, is first tested through the practice of heuristic annotation. The generated dataset is then used to train a Tranformer-based classifier. Last, we compare the performance of the latter with that obtained by instructing a Large Language Model, namely GPT-4.
Does the People's Republic of China (PRC) interfere with European elections through ethnic Chinese diaspora media? This question forms the basis of an ongoing research project exploring how PRC narratives about European elections are represented in Chinese diaspora media, and thus the objectives of PRC news media manipulation. In order to study diaspora media efficiently and at scale, it is necessary to use techniques derived from quantitative text analysis, such as topic modelling. In this paper, we present a pipeline for studying information dynamics in Chinese media. Firstly, we present KeyNMF, a new approach to static and dynamic topic modelling using transformer-based contextual embedding models. We provide benchmark evaluations to demonstrate that our approach is competitive on a number of Chinese datasets and metrics. Secondly, we integrate KeyNMF with existing methods for describing information dynamics in complex systems. We apply this pipeline to data from five news sites, focusing on the period of time leading up to the 2024 European parliamentary elections. Our methods and results demonstrate the effectiveness of KeyNMF for studying information dynamics in Chinese media and lay groundwork for further work addressing the broader research questions.
We performed a zero-shot information extraction study on a historical collection of 89,339 brief Finnish-language interviews of refugee families relocated post-WWII from Finnish Eastern Karelia. Our research objective is two-fold. First, we aim to extract social organizations and hobbies from the free text of the interviews, separately for each family member. These can act as a proxy variable indicating the degree of social integration of refugees in their new environment. Second, we aim to evaluate several alternative ways to approach this task, comparing a number of generative models and a supervised learning approach, to gain a broader insight into the relative merits of these different approaches and their applicability in similar studies. We find that the best generative model (GPT-4) is roughly on par with human performance, at an F-score of 88.8 %. Interestingly, the best open generative model (Llama-3-70B-Instruct) reaches almost the same performance, at 87.7 % F-score, demonstrating that open models are becoming a viable alternative for some practical tasks even on non-English data. Additionally, we test a supervised learning alternative, where we fine-tune a Finnish BERT model (FinBERT) using GPT-4 generated training data. By this method, we achieved an F-score of 84.1 % already with 6K interviews up to an F-score of 86.3 % with 30k interviews. Such an approach would be particularly appealing in cases where the computational resources are limited, or there is a substantial mass of data to process.
Digital tools for text analysis have long been essential for the searchability and accessibility of digitised library collections. Recent computer vision advances have introduced similar capabilities for visual materials, with deep learning-based embeddings showing promise for analysing visual heritage. Given that many books feature visuals in addition to text, taking advantage of these breakthroughs is critical to making library collections open and accessible. In this work, we present a proof-of-concept image search application for exploring images in the National Library of Norway’s pre-1900 books, comparing Vision Transformer (ViT), Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for Language-Image Pre-training (SigLIP) embeddings for image retrieval and classification. Our results show that the application performs well for exact image retrieval, with SigLIP embeddings slightly outperforming CLIP and ViT in both retrieval and classification tasks. Additionally, SigLIP-based image classification can aid in cleaning image datasets from a digitisation pipeline.
Why do people write fanfiction? How, if at all, does fanfiction differ from the source material on which it is based? In this paper, we use quantitative text analysis to address these questions by investigating linguistic differences and similarities between fan-produced texts and their original sources. We analyze fanfiction based on Lord of the Rings, Harry Potter, and Percy Jackson and the Olympians. Working with a corpus of around 250,000 texts containing both fanfiction and sources, we draw on Biber's Multidimensional Analysis biber1988variation, scoring each text along six dimensions of functional variation. Our results identify both global and community-based preferences in the form and function of fanfiction. Crucially, fan-produced texts are found not to diverge from their source material in statistically meaningful ways, suggesting that fans mimic the writing style of the original author. Nevertheless, fans as a whole prefer stories with less focus on narrative and greater emphasis on character interactions than the source text. Our analysis supports the notion proposed by qualitative studies that fanfiction is motivated both by admiration for and frustration with the canon.
Predicting spatial representations in literature is a challenging task that requires advanced machine learning methods and manual annotations. In this paper, we present a study that leverages manual annotations and a BERT language model to automatically detect and recognise non-named spatial entities in a historical corpus of Swiss novels. The annotated data, consisting of Swiss narrative texts in German from the period of 1840 to 1950, was used to train the machine learning model and fine-tune a deep learning model specifically for literary German. The annotation process, facilitated by the use of Prodigy, enabled iterative improvement of the model’s predictions by selecting informative instances from the unlabelled data. Our evaluation metrics (F1 score) demonstrate the model’s ability to predict various categories of spatial entities in our corpus. This new method enables researchers to explore spatial representations in literary text, contributing both to digital humanities and literary studies. While our study shows promising results, we acknowledge challenges such as representativeness of the annotated data, biases in manual annotations, and domain-specific language. By addressing these limitations and discussing the implications of our findings, we provide a foundation for future research in sentiment and spatial analysis in literature. Our findings not only contribute to the understanding of literary narratives but also demonstrate the potential of automated spatial analysis in historical and literary research.
Color symbolism is considered a critical element in art and literature, yet determining the relationship between colors and words has remained largely subjective. This research presents a systematic methodology for quantifying the correlation between language and color. We utilize text-based image search, optical character recognition (OCR), and advanced image processing techniques to establish a connection between words and their corresponding color distributions in the CIELch color space. We generate a color dataset based on human cognition, and apply it for analysis of the literary works of poets associated with Imagism and Black Arts Movements. This helps uncover the characteristic color patterns and symbolic meanings of the movements with enhanced objectivity and reproducibility in literature research. Our work has the potential to provide a powerful instrument for a systematic, quantitative examination of literary symbolism, filling in the gaps in prior analyses and facilitating novel investigations of thematic aspects using color.
Metaphors are everywhere. They appear extensively across all domains of natural language, from the most sophisticated poetry to seemingly dry academic prose. A significant body of research in the cognitive science of language argues for the existence of conceptual metaphors, the systematic structuring of one domain of experience in the language of another. Conceptual metaphors are not simply rhetorical flourishes but are crucial evidence of the role of analogical reasoning in human cognition. In this paper, we ask whether Large Language Models (LLMs) can accurately identify and explain the presence of such conceptual metaphors in natural language data. Using a novel prompting technique based on metaphor annotation guidelines, we demonstrate that LLMs are a promising tool for large-scale computational research on conceptual metaphors. Further, we show that LLMs are able to apply procedural guidelines designed for human annotators, displaying a surprising depth of linguistic knowledge.
This paper describes an update to the open-source Python implementation of the General Imposters method of authorship verification by Mike Kestemont et al. The new algorithm, called Bootstrap Distance Imposters (henceforth BDI), incorporates a key improvement introduced by Potha and Stamatatos, as well as introducing a novel method of bootstrapping that has several attractive properties when compared to the reference algorithm. Initially, we supply an updated version of the Kestemont et al. code (for Python 3.x) which incorporates the same basic improvements. Next, the two approaches are benchmarked using the problems from the multi-lingual PAN 2014 author identification task, as well as the more recent PAN 2021 task. Additionally, the interpretability advantages of BDI are showcased via real-world verification studies. When operating as a summary verifier, BDI tends to be more conservative in its positive attributions, particularly when applied to difficult problem sets like the PAN2014 en \_ novels. In terms of raw performance, the BDI verifier outperforms all PAN2014 entrants and appears slightly stronger than the improved Kestemont GI according to the PAN metrics for both the 2014 and 2021 problems, while also offering superior interpretability.
Archaeological data allow us to synthetically represent the past of individuals and communities over time. This complex representation task requires an amalgamation of variables and makes the intrinsic data vagueness. The study of vagueness as an archaeological data dimension has become a dynamic focus of archaeologists' work in recent years, presenting theoretical and practical approaches for the representation, mainly with fuzzy logic, of archaeological variables. Vagueness in archaeological data can occur due to different reasons: non-existence of evidence, imprecision, errors, subjectivity, etc. Furthermore, the data is usually managed in groups, shared or recovered for subsequent investigations, so the vagueness traceability that is injected due to these management phases is lost. In this paper we present the ongoing work carried out in modeling under fuzzy formal theory the explicit representation of the expertise of the annotator (understood as the professional who introduces archaeological data into a certain system, giving value to the defined variables) in a decoupled way from the value attributed to each variable. The first experiments with chronological and use variables of the sites show how making the annotator's expertise explicit in the fuzzy model allows maintaining the traceability of the uncertainty injected into the archaeological data due to the definition and management of the datasets by different people, as well as establishes a base for implementing archaeological fuzzy decision-based systems.
Valence is a concept that is increasingly being used in the computational study of narrative texts. We discuss the history of the concept and show that the word has been interpreted in various ways. Then we look at a number of Dutch tools for measuring valence. We use them on sample fragments from a large collection of narrative texts and find only moderate correlations between the valences as established by the various tools. We discuss these differences and how to handle them. We argue that the root cause of the problem is that Computational Literary Studies never properly defined the concept of valence in a narrative context.
Measures of textual similarity and divergence are increasingly used to study cultural change. But which measures align, in practice, with social evidence about change? We apply three different representations of text (topic models, document embeddings, and word-level perplexity) to three different corpora (literary studies, economics, and fiction). In every case, works by highly-cited authors and younger authors are textually ahead of the curve. We don't find clear evidence that one representation of text is to be preferred over the others. But alignment with social evidence is strongest when texts are represented through the top quartile of passages, suggesting that a text's impact may depend more on its most forward-looking moments than on sustaining a high level of innovation throughout.
This paper examines the development of figure composition and gaze dynamics between Mary Magdalene and Christ in Italian noli me tangere depictions from 1300 to 1600 in the context of the emergence of perspective painting. It combines a conceptual, interpretative approach concerning the tactility of the gaze with a compositional analysis. This preliminary study analyzes 51 iconographical images to understand how the gazes between Mary and Christ evolve from pre-perspective to perspective artworks. We estimate gaze direction solely from landmark points, following the assumption that the gaze direction can be estimated from the overall face orientation. Additionally, we develop a metric to quantify the degree of visual interaction between the two protagonists. Our results indicate that Christ is consistently depicted gazing down towards Mary, while Mary displays a broader range of gaze directions. Before the introduction of perspective, the gaze of figures was often rendered solely through face orientation. However, with the advent of the high renaissance, artists began to use complex gestures that separated head orientation from the line of sight.
We propose a neural network architecture designed to generate region and page embeddings for boundary detection and classification of documents within a large and heterogeneous historical archive. Our approach is versatile and can be applied to other tasks and datasets. This method enhances the accessibility of historical archives and promotes a more inclusive utilization of historical materials.
This study explores the use of generative AI (GenAI) for annotation in the humanities, comparing direct and indirect annotation approaches with human annotations. Direct annotation involves using GenAI to annotate the entire corpus, while indirect annotation uses GenAI to create training data for a specialized model. The research investigates zero-shot and few-shot methods for direct annotation, alongside an indirect approach incorporating active learning, few-shotting, and k-NN example retrieval. The task focuses on identifying words (also referred to as entities) related to plants and animals in Early Modern Dutch texts. Results show that indirect annotation outperforms zero-shot direct annotation in mimicking human annotations. However, with just a few examples, direct annotation catches up, achieving similar performance to indirect annotation. Analysis of confusion matrices reveals that GenAI annotators make similar types of mistakes, such as confusing parts and products or failing to identify entities, which are broader than those made by humans. Manual error analysis indicates that each annotation method (human, direct, and indirect) has some unique errors. Given the limited scale of this study, it is worthwhile to further explore the relative affordances of direct and indirect GenAI annotation methods.
The addition of controlled terms from linked open datasets and vocabularies to metadata can increase the discoverability and accessibility of digital collections. However, the task of semantic enrichment requires a lot of effort and resources that cultural heritage organizations often lack. State-of-the-art AI technologies can be employed to analyse textual metadata and match it with external semantic resources. Depending on the data characteristics and the objective of the enrichment, different approaches may need to be combined to achieve high-quality results. What is more, human inspection and validation of the automatic annotations should be an integral part of the overall enrichment methodology. In the current paper, we present a methodology and supporting digital platform, which combines a suite of automatic annotation tools with human validation for the enrichment of cultural heritage metadata within the European data space for cultural heritage. The methodology and platform have been applied and evaluated on a set of datasets on crafts heritage, leading to the publication of more than 133K enriched records to the Europeana platform. A statistical analysis of the achieved results is performed, which allows us to draw some interesting insights as to the appropriateness of annotation approaches in different contexts. The process also led to the creation of an openly available annotated dataset, which can be useful for the in-domain adaptation of ML-based enrichment tools.
This study examines gender biases in machine learning models that predict literary canonicity. Using algorithmic fairness metrics like equality of opportunity, equalised odds, and calibration within groups, we show that models violate the fairness metrics, especially by misclassifying non-canonical books by men as canonical. Feature importance analysis shows that text-intrinsic differences between books by men and women authors contribute to these biases. Men have historically dominated canonical literature, which may bias models towards associating men-authored writing styles with literary canonicity. Our study highlights how these biased models can lead to skewed interpretations of literary history and canonicity, potentially reinforcing and perpetuating existing gender disparities in our understanding of literature. This underscores the need to integrate algorithmic fairness in computational literary studies and digital humanities more broadly to foster equitable computational practices.
While an increasing amount of studies exploring named entity recognition in historical corpora are published, application of other information extraction tasks such as event extraction remains scarce. This study explores two accessible methods to facilitate the detection of events and the classification of entities into roles: rule-based systems and RNN-based machine learning techniques. We focus on a German-language corpus from the 15th-17th c. and property purchases as the event types. We show that these relatively simple methods can retrieve useful information and discuss ideas to further enhance the results.
Visibility and intervisibility have been important aspects of spatial analysis in landscape archaeological studies, but remain hampered by computational intensity, small-scale study area, edge effects, and bare-earth digital elevation models. This paper assesses intervisibility and prominence in a dataset of over 1000 burial mounds in the Middle Tundzha River watershed in Bulgaria. The aim is to obviate the pitfalls in regional assessment of visibility through vegetation simulation and MC modelling and to gauge when intervisibility and prominence truly mattered to past mound-builders.
This article explores the use of metadata to analyse German-language one-act plays from 1740 to 1850, addressing the need to expand beyond canonical texts in literary studies. Utilising the Database of German-Language One-Act Plays, we examine aspects such as the number of scenes and characters as well as the role of different original languages on which the translated plays in the corpus are based. We find that one-act plays exhibit strong genre signals that set them apart from multi-act plays of the time. Our metadata-driven approach provides a comprehensive and statistically grounded understanding of the genre, demonstrating the potential of digital methods to enhance genre studies and overcome traditional limitations in literary scholarship.
Multilingual Stylometry: The influence of language and corpus composition on the performance of authorship attribution using corpora from the European Literary Text Collection (ELTeC)
+
+
(long paper)
+
Authors: Christof Schoech, Julia Dudar, Evgeniia Fileva and Artjoms Šeļa
Stylometric authorship attribution is concerned with the task of assigning texts of unknown, pseudonymous or disputed authorship to their most likely author, often based on a comparison of the frequency of a selected set of features that represent the texts. The parameters of the analysis, such as feature selection and the choice of similarity measure or classification algorithm, have received significant attention in the past. Two additional key factors for the performance and reliability of stylometric methods, however, have so far received less attention, namely corpus composition and corpus language. As a first step, the aim of this study is to investigate the influence of language on the performance of stylometric authorship attribution. We address this question using four different corpora derived from the European Literary Text Collection (ELTeC). We use machine-translation to obtain each corpus in the other three languages. We find that, as expected, the attribution accuracy varies between language-based corpora, and that translated corpora, on average, display a lower attribution accuracy compared to their counterparts in the original language. Overall, our study contributes to a better understanding of stylometric methods of authorship attribution.
This paper explores the phenomenon of animacy in prose by the example of German folktales. We present a manually annotated corpus of 19 German folktales from the Brothers Grimm collection and train a classifier on these annotations. Building on previous work in animacy detection, we evaluate the classifier’s performance and its application to a larger corpus. The findings highlight the complexity of animacy in literary texts, distinguishing it from named entity recognition and emphasizing the classifier’s potential for enhancing character recognition in narratives.
This paper outlines a proposal for the use of knowledge graphs for historical German domain adaptation. From the EncycNet project, the encyclopedia-based knowledge graph from the early 20th century was borrowed to examine whether text-based domain adaptation using the source encyclopedia's text or graph-based adaptation produces a better domain-specific model. To evaluate the approach, a novel historical test dataset based on a second encyclopedia of the early 20th century was created. This dataset is categorized by knowledge type (factual, linguistic, lexical) with special attention paid to distinguishing simple and expert knowledge. The main finding is that, surprisingly, simple knowledge has the most potential for improvement, whereas expert knowledge lags behind. In this study, broad signals like simple definitions and word origin yielded the best results, while more specialized knowledge such as synonyms were not as effectively represented. A follow-up study was carried out in favor of simple contemporary lexical knowledge to control for historicity and text genre, where the results confirm that language models can still be enhanced by incorporating simple lexical knowledge using the proposed workflow.
As part of developing a Computational Narrative Understanding, modeling events within stories has recently received significant attention within the digital humanities community. Most of the current research aims at good performance when predicting events. By contrast, we explore a focused approach based on qualitative observations. We attempt to trace the role of structural elements – more specifically, temporal function words – that may be characteristic of a narrative's turning point. We draw on a corpus of UFO sighting reports in which authors employ a prototypical narrative structure that relies on a turning point at which the extraordinary intrudes the ordinary. Using binary logistic regression, we can identify structural properties which are indicative of turning points in our data, showcasing that a focus on detail can fruitfully complement NLP models in gaining a quantitatively informed understanding of narratives.
This paper examines the relationship between violent discourse and emotional intensity in the early revolutionary rhetoric of the People's Republic of China (PRC). Using two fine-tuned bert-base-chinese models—one for detecting violent content in texts and another for assessing their affective charge—we analyze over 185,000 articles published between 1956 and 1989 in the People's Liberation Army Daily ( Jiefangjun Bao ), the official journal of China's armed forces. We find a statistically significant correlation between violent discourse and emotional expression throughout the analyzed period. This strong alignment between violence and affect in official texts provides a valuable context for appreciating how other forms of writing, such as novels and poetry, can disentangle personal emotions from state power.
Virtual particles are peculiar objects. They figure prominently in much of theoretical and experimental research in elementary particle physics. But exactly what they are is far from obvious. In particular, to what extent they should be considered real remains a matter of controversy in philosophy of science. Also their origin and development has only recently come into focus of scholarship in the history of science. In this study, we propose using the intriguing case of virtual particles to discuss the efficacy of Semantic Change Detection (SCD) based on contextualized word embeddings from a domain-adapted BERT model in studying specific scientific concepts. We find that the SCD metrics align well with qualitative research insights in the history and philosophy of science, as well as with the results obtained from Dependency Parsing to determine the frequency and connotations of the term virtual. Still, the metrics of SCD provide additional insights over and above the qualitative research and the Dependency Parsing. Among other things, the metrics suggest that the concept of the virtual particle became more stable after 1950 but at the same time also more polysemous.
We examine the extent to which English and French literature is memorized by freely accessible LLMs, using a name cloze inference task (which focuses on the model's ability to recall proper names from a book). We replicate the key findings of previous research conducted with OpenAI models, concluding that, overall, the degree of memorization is low. Factors that tend to enhance memorization include the absence of copyrights, belonging to the Fantasy or Science Fiction genres, and the work's popularity on the Internet. Delving deeper into the experimental setup using the open source model Olmo and its freely available corpus Dolma, we conducted a study on the evolution of memorization during the LLM’s training phase. Our findings suggest that excerpts of a book online can result in some level of memorization, even if the full text is not included in the training corpus. This observation leads us to conclude that the name cloze inference task is insufficient to definitively determine whether copyright violations have occurred during the training process of an LLM. Furthermore, we highlight certain limitations of the name cloze inference task, particularly the possibility that a model may recognize a book without memorizing its text verbatim. In a pilot experiment, we propose an alternative method that shows promise for producing more robust results.
Intertextuality is a key concept in literary theory that challenges traditional notions of text, signification or authorship. It views texts as part of a vast intertextual network that is constantly evolving and being reconfigured. This paper argues that the field of computational literary studies is the ideal place to conduct a study of intertextuality since we have now the ability to systematically compare texts with each others. Specifically, we present a work on a corpus of more than 12.000 French fictions from the 18th, 19th and early 20th century. We focus on evaluating the underlying roles of two literary notions, sub-genres and the literary canon in the framing of textuality. The article attempts to operationalize intertextuality using state-of-the-art contextual language models to encode novels and capture features that go beyond simple lexical or thematic approaches. Our findings suggest that both subgenres and canonicity play a significant role in shaping textual similarities within French fiction. These discoveries point to the importance of considering genre and canon as dynamic forces that influence the evolution and intertextual connections of literary works within specific historical contexts.
As they represent one of the most complex forms of expression, literary texts continue to challenge Sentiment Analysis (SA) tools, often developed for other domains. At the same time, SA is becoming an increasingly central method in literary analysis itself, which raises the question of what are the challenges inherent to literary SA. We address this question by probing units from a variety of literary fiction texts where humans and systems diverge in their valence scoring, seeking to relate such disagreements to semantic traits central to implicit sentiment evocation in literary theory. The contribution of this study is twofold. First, we present a corpus of valence-annotated fiction -- English and Danish language literary texts from the 19 th and 20 th centuries -- representing different genres. We then test whether sentences where humans and models disagree in sentiment annotation are characterized by specific semantic traits by looking at their distribution and correlation across four different corpora. We find that items where humans detected significant sentiment, but where models did not, consistently employ lower levels of arousal, dominance and interoception, and higher levels of concreteness. Furthermore, we find that the amount of human-model disagreement correlated with semantic aspects is linked to the interiority-exteriority continuum more than with direct sensory information. Finally, we show that this interaction of features linked to implicit sentiment varies across textual domains. Our findings confirm that sentiment evocation exploits a more diverse and subtle set of semantic channels than those observed through simple sentiment analysis.