-
I'm using online topic modeling with River, and update the topic model 1k documents per batch. I've noticed some instances where bertopic assigns documents to a different topic even though there's an existing doc&topic that are extreamly similar. Some examples below where different news headlines of Presley's death got put into different topics. Not sure how to approach this, would greatly appreciate any guidance on how to guide the model better. Huge fan of Bertopic! Topic Name 73_daughter elvis_daughter elvis presley_abuse_lord 87_arrest lisa marie_arrest lisa_cardiac arrest lisa_marie presley died 61_relationships_passion_introverts_raider 45_crying_slut_tears_im crying 51_sanha_dogs_rampal ji maharaj_sant 70_parker_wes christian_singer daughter elvis_baby shark |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
We are facing similar problems... Remark: our experiment was done with 9K documents, being all about 1 page of text. First, we observed that BERTopic is not reproducible : sending a document several times is bringing with different results. Searching on this, we found documentations/discussions explaining that UMAP is including a random behaviour in its functioning. We fixed this by enforcing all vectors to be process one by one (especially in the training process) with an identical random sequence init (Seed). We get a system where, when sending a document already used in the training process, we properly get it back with a cosin distance of 1, and an Euclidian distance of 0. We then observed that BERTopic is hyper-sensitive to the input content: when changing a single word in a document already used for the training, the nearest document (with both cosin and euclidian distances) becomes a document with a lot of differences (perhaps 1/3 of the document is different). Reading somewhere that HDBSCAN is doing some approximations in its calculation, we then tried to replace UMAP+HDBSCAN by PCA+KMeans... same hyper-sensivity. This is really not the behaviour we were expected from such a tool: how can BERTopic group similar documents together if its vectorisation is hyper-sensitive to very small changes? Bug? |
Beta Was this translation helpful? Give feedback.
-
@vantubbe In part, it depends on the sub-models that you used to perform the online topic modeling such as the embedding model, dimensionality reduction, clustering, etc. That all can greatly influence how the topics are being clustered. Perhaps the embedding model is trained to focus on a specific part of the text and less on the context, perhaps the dimensionality reduction algorithm needs more or le It is difficult to say without actually seeing the code and knowing which sub-models are being used. Having said that, it might be worthwhile to check out some of the parameter tunings here. Also, you could use the @EtienneAb3d I might be mistaken here, but based on the sub-models that you mention we are not talking about online topic modeling right?
You can find a bit more about UMAP and this process in the FAQ here. You can also find a link there to the UMAP documentation where this is discussed a bit more in detail.
It is difficult to say without seeing the actual code, since hypterprameters of the sub-models can influence this greatly, but it may also depend on the chosen embedding model and the number of topics that are being generated. For example, a word embedding model might place more emphasis on single words then a sentence-transformer model. Moreover, if you generate many topics, which you can control with k-Means or
It may not necessarily be that the vectorization (at least if we are talking about the BoW step) is hyper-sensitive but the process before that may be the culprit here. The topic representation step is mostly influenced by how the documents are being brought together and to a lesser extent by changing a single word. My guess would be that there is much to gain in the steps before BoW. Having said that, could you share some of your code illustrating this issue? |
Beta Was this translation helpful? Give feedback.
@vantubbe In part, it depends on the sub-models that you used to perform the online topic modeling such as the embedding model, dimensionality reduction, clustering, etc. That all can greatly influence how the topics are being clustered. Perhaps the embedding model is trained to focus on a specific part of the text and less on the context, perhaps the dimensionality reduction algorithm needs more or le
It is difficult to say without actually seeing the code and knowing which sub-models are being used. Having said that, it might be worthwhile to check out some of the parameter tunings here. Also, you could use the
.clusters
attribute in the river algorithm to checkout some of the clusters …