Best practices for modeling topics in dialogs #1745

lpietrobon · 2024-01-12T19:28:11Z

lpietrobon
Jan 12, 2024

I have a dataset representing a long dialog among several characters, and I'd like to extract the topics being discussed. I was about to embed sentences, cluster and describe the clusters ..and then I found out about BERTopic: thank you for doing all the hard work!

I am trying to understand how to best represent my dialog into BERTopic. The dialog I am dealing with is "chatty": I have the sender, timestamp and text of each turn/message, but most messages are rather short and often rely on context (ie previous messages or "reply-to" messages) for their meaning.

Question: have others tried to model dialog/chat dataset? is there a place where I can read up on best practices with this kind of dataset?

A few challenges I've encountered so far in bringing context to individual messages:

I tried building documents by having a sliding-window over the last N messages. I can see that the output changes as a function of N, both in terms of how many clusters and in terms of the content/representation of each clusters, so this seems an important parameter to set.
a. how can I go about finding a good value for N? (I mean beyond the "vibes check" of trying a few and seeing what happens)
b. I can also create a bunch of BERTopic models for different N, and then merge them ...does this sound like a good idea? Is there a principled way of telling whether it works better than using only 1 value for N?
some relevant context can be explicitly reconstructed: sometimes participants explicitly reference previous text they are replying to, in such a way that I can reconstruct easily the graph of dependencies when they occur. What would be a good way to leverage this information as context for the topic modeling?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for modeling topics in dialogs #1745

{{title}}

Replies: 0 comments

Select a reply

Best practices for modeling topics in dialogs #1745

lpietrobon Jan 12, 2024

Replies: 0 comments

lpietrobon
Jan 12, 2024