-
Hello, Following from my previous query about the stopwords... My very large dataset has been cleaned of URL patterns, prior to use in the modeling, like so:
Similar to the example you provided in your collab notebook on Trump Tweets. The model developed 50 topics, with 23 topics containing 'https' or 'http' as part of the topic label. After producing the model, I rechecked the cleaned data and am unable to find any rows with 'http' as a pattern. I'm perplexed as to how this could happen. Should I add 'http' and 'https' to the stop words and completely retrain? P.S. I am using the |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
If the words 'https' and 'http' are not found in the documents on which you trained the model, then they cannot end up in the topic representations. That means that there might be something going wrong with preprocessing the data and that you input documents that still contain these words. Thus, I would advise checking out the input data and making sure that these words are not found there. Could you also share all code for training your model? Perhaps we can find something there. |
Beta Was this translation helpful? Give feedback.
If the words 'https' and 'http' are not found in the documents on which you trained the model, then they cannot end up in the topic representations. That means that there might be something going wrong with preprocessing the data and that you input documents that still contain these words. Thus, I would advise checking out the input data and making sure that these words are not found there.
Could you also share all code for training your model? Perhaps we can find something there.