-
Notifications
You must be signed in to change notification settings - Fork 23
Review of a lecture 2 A Diversity Promoting Objective Function for Neural Conversation Models
skdy33 edited this page Feb 1, 2018
·
1 revision
Idx | Contents |
---|---|
Topic | We suggest that the traditional objective function, i.e., the likelihood of output (response) given input (message) is unsuited to response generation tasks. Instead we propose using Maximum Mutual Information (MMI) as the objective function in neural models. |
Dataset | Twitter Conversation Triple Dataset, OpenSubtitles dataset, |
Github | Torch implementation |
Conclusion | We show that use of MMI results in a clear decrease in the proportion of generic response sequences, generating correspondingly more varied and interesting outputs |
Analysis 1 |
- In part at least, this behavior can be ascribed to the relative frequency of generic responses like I don’t know in conversational datasets, in contrast with the relative sparsity of more contentful alternative responses
- Intuitively, it seems desirable to take into account not only the dependency of responses on messages, but also the inverse, the likelihood that a message will be provided to a given response.s
$\frac{p(S,T)}{p(S)p(T)} = argmax{ logp(T|S) - logp(T)}$ - Current research only used the criterion in testing time
- reason 1 : nontrivial to calculate it
- reason 2 : time consuming
$logp(T|S) - \lambda logp(T)$ - limit : ungrammatical responses
- It penalizes not only high-frequency, generic responses, but also fluent ones and thus can lead to ungrammatical outputs
- solution : we replace the language model p(T) with U(T)
-
$U(T) = \prod p(t_k|t_1,...,t_{k-1})*g(k)$ where$g(k) = 1$ if$k \leq \gamma$ $g(k) = 0$ - intuition : Thanks to memory-loss property of seq-2-seq models, first few sentences determine the decoded sentences. Therefore, penalizing only those is valid.
-
$(1-\lambda)logp(T|S) + \lambda logp(S|T)$ - limit : decoding intractable
- Reason's simple. We literally can't compute
$logp(S|T)$
- Reason's simple. We literally can't compute
- solution : Pick N-best lists from
$logp(T|S)$ and compute$log(S|T)$
- length of the sequences is Important
- Score(T) =
$p(T|S) - \lambda U(T) + \gamma U(T)$
- Score(T) =
- hyperparams
$\gamma$ and$\lambda$ - optimize by MERT with N-best lists by beam search.
- hyperparams
$\gamma$ and$\lambda$ - optimize by MERT with N-best lists by beam search.