conclusions: add limitation about self-enhancement bias

greenelab · May 23, 2024 · 5bd238d · 5bd238d
1 parent 017cd9f
commit 5bd238d
Showing 1 changed file with 5 additions and 4 deletions.
diff --git a/content/05.conclusions.md b/content/05.conclusions.md
@@ -13,15 +13,16 @@ We found that most paragraphs were enhanced, while in some cases the model remov
 The AI model also highlighted certain paragraphs that were difficult to revise, which could pose challenges for human readers as well.
 
 
-We designed section-specific prompts to guide the revision of text using GPT-3.
-Surprisingly, in one Methods section, the model detected an error when referencing a symbol in an equation that had been overlooked by humans.
-However, revising abstracts proved more challenging for the model, as revisions often removed background information about the research problem.
+Our approach has some limitations.
+We found that revising abstracts proved more challenging for the model, as revisions often removed background information about the research problem.
 There are opportunities to improve the AI-based revisions, such as further refining prompts using few-shot learning [@doi:10.1145/3386252], or fine-tuning the model using an additional corpus of academic writing focused on particularly challenging sections.
 Fine-tuning using preprint-publication pairs [@doi:10.1371/journal.pbio.3001470] may help to identify sections or phrases likely to be changed during peer review.
 Our approach processed each paragraph of the text but lacked a contextual thread between queries, which mainly affected the Results and Methods sections.
 Using chatbots that retain context could enable the revision of individual paragraphs while considering previously processed text.
 We plan to update our workflow to support this strategy.
-Open and semi-open models, such as BLOOM [@arxiv:2211.05100], Meta's Llama 2 [@arxiv:2307.09288], and Mistral 7B [@arxiv:2310.06825], are growing in popularity and capabilities, but they lack the user-friendly OpenAI API.
+Regarding the LLM used, open and semi-open models, such as BLOOM [@arxiv:2211.05100], Meta's Llama 2 [@arxiv:2307.09288], and Mistral 7B [@arxiv:2310.06825], are growing in popularity and capabilities, but they lack the user-friendly OpenAI API.
+We used the LLM-as-a-Judge method to automatically assess the quality of revisions, which has limitations such as the self-enhancement bias where LLMs tend to favor text generated by them.
+Although our approach is based on revising human-generated text (and not generating answers from scratch), we used two LLM judges to address this potential issue: GPT 3.5 and GPT 4, which showed limited self-enhancement bias and high alignment with human preferences [@arxiv:2306.05685], and we found in this study that the automated assessments were consistent with our human evaluations.
 Despite these limitations, we found that models captured the main ideas and generated a revision that often communicated the intended meaning more clearly and concisely.
 While our study focused on OpenAI's GPT-3 and GPT-3.5 Turbo for revisions, the Manubot AI Editor is prepared to support future models.