-
Introduction
-
System Architecture
-
Context Builder
-
Dialog Generator
-
Narrative Planner
-
Implementation Roadmap
-
References
Over the past few years, language models have shown incredible promise in their ability to generate syntactically coherent sentences. However, their generative capabilities are limited by relatively short attention spans and an inability to maintain the complex relationships between characters, objects, and the overall story arc. This leads to a lack of coherence in the stories that are written by these models. Despite this shortcoming, it may be possible to generate much more compelling narratives that maintain the coherence of the story by using multiple transformers trained for specific purposes and incorporating some additional structure around them. In this document, I describe a more structured, pipeline-based approach that merges the strengths of these statistical models with human knowledge about the script-writing process.
The core of the system I am proposing consists of three main components:
- a context builder that uses the past story sequence and emotional arc to produce text sequences that can better condition the generation of the transformers,
- a dialog generator that uses six distinct transformers to generate dialog, and
- a narrative planner that uses a transformer to generate screen direction/action to intersperse with the dialog and help drive the plot forward.
The context builder plays a critical role in ensuring that the dialog generator and narrative planner make use of relevant details that have already been generated as part of the script when determining what to write next. It gathers past details based on a narrative frame that matches the current emotional goals of the story. These narrative frames have associated questions that can be used to fill the slots of the frame and, once filled, can produce the text to condition the transformers with.
Further details for each component are provided below.
System Initialization
-
Emotional arc of the story
-
A list of discrete emotions alternating between how character X and character Y are intended to feel or respond at that point in the narrative.
-
Set of narrative frames
-
First dialog entry
-
Screen direction / action or scene description (optional)
This component is responsible for producing text with which to condition the transformers in the Dialog Generator and Narrative Planner.
Input
-
Screen direction / action
-
The current and next emotion in the emotional arc
Output
-
Text with which to condition later generation
-
i.e. screen direction / action, dialog
Mechanism
Pick a narrative frame from the set of hand-made narrative frames that matches the expected input and output emotions for the next narrative event and dialog. For each of the event attributes in the frame, use a span retrieval transformer to find the associated property in the story text so far.
-
Input emotion (the emotion of Character X)
-
Output emotion (how Character Y is intended to react/respond)
-
List of event attributes with associated questions for retrieval from the script text
(Character X about to attack Character Y)
{
"character_x_emotion: {anger},
"character_y_emotion": {surprise, anger, fear},
"event_attributes:" {
"Character_X is holding <object>": "What is Character_X holding?",
"Character_X is <position/location>": "Where is Character_X?",
"Character_Y is <position/location>": "Where is Character_Y?"
}
}
From this, you would create a short paragraph (using the event attributes, character_X emotion, and the last two pairs of dialog) to condition the next bit of generation. For example, the above frame would be converted to the following via simple slot filling language generation:
Text for Conditioning:
Character_X is angry. Character_X is holding a knife. Character X is standing up. Character_Y is in front of Character_X.
Character_X: "I don't like what you've done, Character_Y".
Character_Y: "Well, I don't really like you, Character_X".
We can then predict the next bit of dialog, or the next bit of action/scene context using the Dialog Generator and Narrative Planner components.
While transformers are good at producing syntactically correct sentences, they often fall down when trying to maintain coherence across long spans of text. They can quickly go off on tangents unrelated to the narrative spine of the scene and introduce new objects or characters while forgetting others. While this is detrimental to maintaining causal coherence in the story and is confusing for the human reader, it also precludes the transformer from being able to foreshadow later events since it will "forget" what it has already written.
Building off the work done by the students in CS 338 this quarter, this component consists of 6 transformers each trained on a set of dialogs coded with one of the 6 discrete emotions in the EmotionLines dataset [1].
Input
-
Screen direction / action
-
The last utterance from the other character
-
The emotion of the utterance to generate
Output
- The next utterances (call and response) of the characters
This component consists of a generative transformer (e.g. GPT-2 [2]) to help drive the narrative forward by generating additional scene setup and action to go along with the dialog. This model should be fine-tuned with datasets such as RocStories [3] or GLUCOSE [4] in order to help it better generate causally coherent and meaningful actions for the script.
Input
-
Screen direction / action
-
The last two character utterances
Output
- Screen direction / action
This phase consists of the initial work to build a skeleton for the system including class definitions and functions that make accessing necessary components for generation/prediction/retrieval easy to do. The skeleton should have dummy dataflows set up and be well commented so the students know where to put their code once they have built code and models that can carry out the desired functionality. Ultimately, I think we want the focus of the undergrads to be on model building and frame generation, and not on the setup of the system skeleton.
The work in this phase can be done in parallel and would be where the undergrads in CS 338 can and should be heavily involved.
-
Build the Narrative Planner
-
Fine-tune the narrative planner transformer
- GPT-2 large trained on RocStories / GLUCOSE
-
-
Build the Dialog Generator
-
Fine-tune the emotion transformers
- 6 GPT-2-small trained on EmotionLines
-
-
Build the Context Builder
-
Fine-tune the span-retrieval transformer
- RoBERTa-large [5] trained on SQuAD 2.0 [6]
-
-
Build the set of narrative frames
This phase consists of work that could be done in the future, assuming that the overall approach proposed above works out. This would include exploring other decoding techniques (e.g. nucleus sampling [7]), the automatic creation of narrative frames using commonsense knowledge graphs [8], and devising better methods for deciding which narrative frame should be used next.