Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document-topic distribution #2

Open
devanshrj opened this issue May 9, 2023 · 2 comments
Open

Document-topic distribution #2

devanshrj opened this issue May 9, 2023 · 2 comments

Comments

@devanshrj
Copy link

Hi, thank you for this awesome work!

I would like to use KDTM to generate topics and document-topic distribution on a corpus containing 4.4M tweets (each tweet can be considered a document). Can you let me know how I can obtain the document-topic distribution? The closest method I can find to this is save_document_representations(), but I am not sure if it's the same thing.

Also, my dataset does not have any labels, so I wanted to know if labels are a part of the training process or if they are optional.

Thanks in advance!

@ahoho
Copy link
Owner

ahoho commented May 9, 2023

Thanks for your interest!

To your first question, that function will get document-topic distributions, but it's just a single sample. For a later paper, we modified the function to sample multiple times and take the mean (if I recall correctly, there's no analytical mean for a logistic-normal). You can see the modified code in this branch. In fact, if my commit history is to be trusted, you can view the exact changes here.

Labels (as well as covariates) are optional and all reported results are unsupervised.

Not that you asked, but you should also note that we realized the NPMI implementation in this repo (ported from the original Scholar paper) is nonstandard, and I believe we calculate it during training. You should prefer implementations from Gensim, OCTIS, Palmetto, or us. Of course, the best bet is to forgo automated metrics altogether 😉

@ahoho
Copy link
Owner

ahoho commented May 9, 2023

Another thing you didn't ask: we've found that mallet works surprisingly well with Tweets, in case you haven't tried it already and are looking for a good baseline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants