Can topic labels contain non-words? #930

salderma · 2023-01-15T03:35:43Z

salderma
Jan 15, 2023

Hello,

Following from my previous query about the stopwords...

My very large dataset has been cleaned of URL patterns, prior to use in the modeling, like so:

df['cleantext'] = df.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)

Similar to the example you provided in your collab notebook on Trump Tweets. The model developed 50 topics, with 23 topics containing 'https' or 'http' as part of the topic label. After producing the model, I rechecked the cleaned data and am unable to find any rows with 'http' as a pattern. I'm perplexed as to how this could happen.

Should I add 'http' and 'https' to the stop words and completely retrain?

P.S. I am using the partial_fit() technique (with MiniBatchKMeans and OnlineCountVectorizer) to batch 500,000 documents at a time, since my dataset containers over 20M docs.

Answered by MaartenGr

Jan 16, 2023

If the words 'https' and 'http' are not found in the documents on which you trained the model, then they cannot end up in the topic representations. That means that there might be something going wrong with preprocessing the data and that you input documents that still contain these words. Thus, I would advise checking out the input data and making sure that these words are not found there.

Could you also share all code for training your model? Perhaps we can find something there.

View full answer

MaartenGr · 2023-01-16T07:12:03Z

MaartenGr
Jan 16, 2023
Maintainer

If the words 'https' and 'http' are not found in the documents on which you trained the model, then they cannot end up in the topic representations. That means that there might be something going wrong with preprocessing the data and that you input documents that still contain these words. Thus, I would advise checking out the input data and making sure that these words are not found there.

Could you also share all code for training your model? Perhaps we can find something there.

2 replies

salderma Jan 16, 2023
Author

After some research, I've found the issue in the regex's involved in cleaning the text.

I had followed/copied your methods used in the Trump Tweets Dynamic Topics collab notebook -

import re
import pandas as pd
from datetime import datetime

# Load data
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')

# Filter
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

but was putting the output to a new column, instead of back into the original column - making only the last regex substitution apply. I typically make a single function which performs all the different cleaning regexs to use with df.apply(), but your code looked a little more trim and concise, so I tried it with out thinking much about it. 🤦‍♂️

MaartenGr Jan 17, 2023
Maintainer

Great to hear that you found out what was happening! If you run into any other issues, feel free to reach out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can topic labels contain non-words? #930

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Can topic labels contain non-words? #930

salderma Jan 15, 2023

Replies: 1 comment · 2 replies

MaartenGr Jan 16, 2023 Maintainer

salderma Jan 16, 2023 Author

MaartenGr Jan 17, 2023 Maintainer

salderma
Jan 15, 2023

Replies: 1 comment 2 replies

MaartenGr
Jan 16, 2023
Maintainer

salderma Jan 16, 2023
Author

MaartenGr Jan 17, 2023
Maintainer