Skip to content

Commit

Permalink
conclusions: improve text about limitation about self-enhancement bias
Browse files Browse the repository at this point in the history
  • Loading branch information
miltondp committed May 23, 2024
1 parent 5bd238d commit bc99738
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions content/05.conclusions.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,10 @@ Our approach processed each paragraph of the text but lacked a contextual thread
Using chatbots that retain context could enable the revision of individual paragraphs while considering previously processed text.
We plan to update our workflow to support this strategy.
Regarding the LLM used, open and semi-open models, such as BLOOM [@arxiv:2211.05100], Meta's Llama 2 [@arxiv:2307.09288], and Mistral 7B [@arxiv:2310.06825], are growing in popularity and capabilities, but they lack the user-friendly OpenAI API.
We used the LLM-as-a-Judge method to automatically assess the quality of revisions, which has limitations such as the self-enhancement bias where LLMs tend to favor text generated by them.
Although our approach is based on revising human-generated text (and not generating answers from scratch), we used two LLM judges to address this potential issue: GPT 3.5 and GPT 4, which showed limited self-enhancement bias and high alignment with human preferences [@arxiv:2306.05685], and we found in this study that the automated assessments were consistent with our human evaluations.
We used the LLM-as-a-Judge method to automatically assess the quality of revisions, which has limitations such as the self-enhancement bias where LLMs tend to favor text generated by themselves.
Although our approach is based on revising human-generated text (rather than generating answers from scratch), we used two LLM judges, GPT-3.5 and GPT-4, to address this potential issue.
These two models have shown limited self-enhancement bias and high alignment with human preferences [@arxiv:2306.05685].
In this study, we found that the automated assessments were consistent with our human evaluations.
Despite these limitations, we found that models captured the main ideas and generated a revision that often communicated the intended meaning more clearly and concisely.
While our study focused on OpenAI's GPT-3 and GPT-3.5 Turbo for revisions, the Manubot AI Editor is prepared to support future models.

Expand Down

0 comments on commit bc99738

Please sign in to comment.