Questions on Topic Modeling #1

brianp1 · 2016-12-01T06:08:05Z

Not sure if this is the correct place to post a question, but here it goes:
I was reviewing the topic modeling code we went over in class as I was trying to figure out how to write the code for my final project.

In this chunk of code

library(topicmodels)
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda

what is the control doing,exactly?
The help file just says it controls the parameters.

Conceptually, my data is already in a tidy text format, is it redundant for me to run it through the cast_dtm function just to have it in the tidy format, or do I need the cast_dtm function in order to pass the document term matrix throught the lda function?
So, this chunk of code:

top_terms <- chapters_lda_td %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms

It organizes the terms with the highest beta within the topic, and this is how we determine the topics with the words with the highest beta?

Writing my own code, I decided to try to format the code from tidy text to just get a sense of what I am doing and I simply got two topics of stop words even though I thought I the anti-join got rid of the stop words? In addition, I was wondering how to add stop words like Hillary since I really want to tease apart differences in policy and politics.

speech_td <- speech_corpus %>%
  group_by(author, docnumber) %>%
  count(word) %>%
  select(author, word, n, docnumber) 
speech_td

speech_dtm <- speech_td %>%
  anti_join(stop_words, by = c(term = "word")) %>%
  cast_dtm(word, n, docid)
speech_dtm

speech_lda <- LDA(speech_dtm, k = 2)

speech_lda_td <- tidy(speech_lda)

top_terms <- speech_lda_td %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms

The text was updated successfully, but these errors were encountered:

bensoltoff · 2016-12-01T15:04:43Z

This would specify additional controls for the LDA function. Run ?LDAcontrol-class`` in the console to get a list of potential options for control. Note that you shouldn't need to change any of them for your model to run. At most adjust `k` to control the number of topics in the model.
LDA() won't work with a tidytext data frame. It requires a document-term matrix, so you have to convert your tidy data using cast_dtm() in order to estimate an LDA model.
This identifies the words with the strongest association with the given topic. Remember that LDA doesn't tell you what the topic actually is, it just identifies them as Topic 1, Topic 2, etc. By looking at the words most strongly associated with the topic, you can attempt to label the topic given your knowledge of the words.
What is speech_corpus? Has this already tokenized the text? If so, what is the output of this?

speech_td %>%
  anti_join(stop_words, by = c(term = "word"))

You should get a data frame with the stop words removed. You can directly look at stop_words to see what terms this includes. However by only setting 2 topics in the model, any remaining stopwords might form the dominant topic structure. You could try to avoid this by instead of using a term-frequency weighted dtm, use tf-idf to weight it. To do this, change cast_dtm to cast_dtm(word, n, docid, weighting = tm::weightTfIdf) This uses the weightTfIdf function from the tm library to adjust the weights given how frequently a term appears across all documents.

brianp1 · 2016-12-01T20:55:09Z

Thank you. With your assistance, I was able to get my first 4 topics modeled. Not shocking, the topics all resemble american, people, country, and president. So, I need to go back and modify my stop words and try to the tf-idf. A couple more questions. So, I was updating my sentiment graphs, and this results dawned on me:

And I realize all this is demonstrating to me is that I have significantly more Trump speeches than I do ay other candidate. This will also be a problem when examining the topics of the campaign. So, I have a few options.

I can either match the number of speeches between Sanders, Clinton, and Trump by randomly selecting 17 speeches from each for this particular analysis and the topic modeling for the entire campaign
I can just accept the sheer number of speeches and possibly write it up as the amount of information that the voting public was subjected to.
I ignore these analyses and simply run topic modeling for each given candidate and then compare the candidates. I was planning on doing this anyways, but this would simply take the forefront of the topic modeling analysis
I try to control for a politician who simply has more words by something like dividing by the number of documents to get like an average sentiment?

In addition, I was wondering in what capacity I could try n-grams? It may not be incredibly beneficial for the sentiment aspect, but would it be useful for the topic modeling?

bensoltoff · 2016-12-01T21:02:31Z

Differential number of speeches per candidate isn't a bad thing. In your graphs, present the bars as a percentage of each candidate's speeches allocated to each emotion, rather than raw frequency count. This will normalize for the total number of words in each candidates' corpus and allow you to compare relative affective content between candidates.

You could use n-grams for topic modeling (not sentiment analysis, well certainly not in an easy manner that can be done before the project is due), especially if key phrases or slogans are used repeatedly (#MAGA).

brianp1 · 2016-12-02T00:56:58Z

I'll look through the n-gram literature in a minute.
So, I was doing the inverse doc frequency, and I went back to try and clean it up, get rid of general names, state names, contracts, so I decided to create and add my own list of stop words:

mystopwords <- data_frame(word = c("texas", "smith", "cooper", "tianna", "barbara", "freia", "ruline", "miami", "reid", "caroline", "smith", "netanyahu", "michael", "gordon", "gordy", "sharansky", "don't", "that's", "they're", "we're", "mcdowell", "steve", "sanders", "milwaukee", "maine", "jackson", "indiana"))

But, when I go to filter the document to make sure these words are no longer there, I get this error message:

Error in eval(substitute(expr), envir, enclos) :
corrupt 'grouped_df', contains 116895 rows, and 331040 rows in groups

I am not sure what is going on.

Also, I am trying to get an average of this count, essentially the number of times there is a pause for chanting or applauding divided by the number of speeches given:
Here is where I am at so far, but I don't think I am on the right track seeing as I want to divided the number of pauses with the total number of speeches:

speech_corpus %>%
  group_by(author) %>%
  filter(word == "applause" | word == "cheers")
  count() %>%
  kable()

speech_corpus %>%
  group_by(author, docnumber) %>%
  filter(word == "applause" | word == "cheers") %>%
  count() %>%
  mutate(app_sum = sum(n))  %>%
  mean(n)

Sorry, if this is vague, I can try to be more specific.

brianp1 · 2016-12-02T01:23:27Z

Also, is the seed variable just a random number generator or is it something I am supposed to calculate?

bensoltoff · 2016-12-02T03:28:06Z

The seed is basically a random number generator. Set it once at the beginning of the script (set.seed(1234)) and you are done.

What exactly is the code you are using to merge mystopwords with your corpus?

For the last question, I think this code will work (it does in my head at least):

speech_corpus %>%
  group_by(author, docnumber) %>%
  filter(word == "applause" | word == "cheers") %>%
  count() %>%
  group_by(author) %>%
  mutate(n_per_speech = n / n())

brianp1 · 2016-12-02T04:16:30Z

Thank you for the assistance.
So, I no longer get the error. I think the problem was a broken pipe. Referring to the binding, I realized that I had performed the anti-join right before turning the vector into the dtm format, and for the tf_idf i was pushing the vector through. However, I do have another error. So, here is the code that I am using:

speech_corpus <- bind_rows(Trump_Corpus, Clinton_Corpus, Sanders_Corpus, Repub_Corpus)
mystopwords <- data_frame(word = c("texas", "smith", "cooper", "tianna", "barbara", "freia", "ruline", "miami", "reid", "caroline", "smith", "netanyahu", "michael", "gordon", "gordy", "sharansky", "don't", "that's", "they're", "we're", "mcdowell", "steve", "sanders", "milwaukee", "maine", "jackson", "indiana", "iowa", "september", "dr", "al", "gabby", "jack", "ben", "vermont"))
mystopwords <- bind_rows(stop_words, mystopwords)


speech_td <- speech_corpus %>%
  group_by(author, docnumber) %>%
  filter(word != "applause") %>%
  count(word) %>%
  select(author, word, n, docnumber) %>%
  mutate(docid = paste0(author, docnumber)) %>%
  anti_join(mystopwords)
speech_td

inverse_doc_freq <- speech_td %>%
  bind_tf_idf(word, docid, n) %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))
inverse_doc_freq


ggplot(inverse_doc_freq[1:25,], aes(word, tf_idf, fill = author)) +
  geom_bar(alpha = 0.8, stat = "identity", scales = "free") +
  coord_flip()

inverse_doc_freq %>%
  group_by(author) %>%
  top_n(20)%>%
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = author)) +
  geom_bar(stat = "identity") +
  facet_wrap(~author, scales = "free") +
  coord_flip()

However, I keep getting the contractions in the Sanders result. Is it a problem with the apostrophes? I even tried adding spaces before and after the words to see if that would make a difference but it didn't.

Also, I am working on cleaning up my web scraping process and I just can't seem to get it to work, and I think at this point, I have just been starring at it too long.

Here is the code:

get_Trump_speeches <- function(x, y){
  mutate(trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x))
  
  df1 <- read_html(trump_text_url) %>%
    html_nodes("p") %>%
    html_text()
  
  df2 <- read_html(trump_text_url)%>%
    html_node(".docdate")%>%
    html_text()
  
  speech <- data_frame(text = df1) %>%
    mutate(author = "Trump",
           docnumber = y,
           parnumber = row_number(),
           date = df2) %>%
    separate(date, into = c("date2", "year"), sep = ",") %>%
    separate(date2, into = c("month", "day"), sep = " ")
  speech <- unnest_tokens(speech, word, text, token = "words")  
  return(speech)
}  
x = c("119182",  
  "119181",
  "119188",
  "119187",
  "119186",  
  "119185",  
  "119184",  
  "119183",
  "119174",
  "119172",  
  "119180",  
  "119173",  
  "119170",  
  "119169",  
  "119168",  
  "119167",  
  "119166",  
  "119179",  
  "119202",  
  "119201",  
  "119200",  
  "119203",  
  "119191",  
  "119189",  
  "119192",  
  "119207",  
  "119208",  
  "119209", 
  "119190",  
  "119206",  
  "119206",  
  "119193",  
  "119205",  
  "119178",  
  "119204",  
  "119194",  
  "119195",  
  "119177",  
  "119197",  
  "119199",  
  "119198",  
  "119196",  
  '119176',  
  "119175",  
  "119165",  
  "119503",  
  "117935",  
  "117791",  
  "117815",  
  "117790",  
  "117775",  
  "117813",  
  "116597")
y = c("1",
      "2",
      "3",
      "4",
      "5",
      "6",
      "7",
      "8",
      "9",
      "10",
      "11",
      "12",
      "13",
      "14",
      "15",
      "16",
      "17",
      "18",
      "19",
      "20",
      "21",
      "22",
      "23",
      "24",
      "25",
      "26",
      "27",
      "28",
      "29",
      "30",
      "31",
      "32",
      "33",
      "34",
      "35",
      "36",
      "37",
      "38",
      "39",
      "40",
      "41",
      "42",
      "43",
      "44",
      "45",
      "46",
      "47",
      "48",
      "49",
      "50",
      "51",
      "52",
      "53")
map2(x, y, get_Trump_speeches)

Also, I just wanted to say thank you sooo much for all your assistance today. You really have taught me a lot, and I am truly appreciative for the continued guidance.

bensoltoff · 2016-12-02T22:23:34Z

On the first issue, tidytext doesn't do anything with contractions. "can't" is a valid token in the eyes of tidytext. You'd have to manually remove the contraction, but that loses some important meaning. "We can do this!" is a positive affirmation. "We can't do this!" is a negative affirmation.

If you stick to topic modeling or predicting candidate based on their text, contractions are not a problem. If you want to do sentiment analysis, check out the replace_contraction function in qdap. It looks like it replaces common contractions with the full term. I.e. "can't" becomes "cannot", "don't" becomes "do not", etc. I've never used it before, but it might prove useful.

bensoltoff · 2016-12-02T22:31:39Z

mutate(trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x))

This is your problem. mutate only works on data frames, but you are just trying to create a string object. Change it to trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x) and the function should work.

Also, you're making it inefficient by creating a separate vector for y - document number. Just create it using the map_df function, like this:

library(tidyverse)
library(rvest)
library(stringr)
library(tidytext)

get_Trump_speeches <- function(x){
  trump_text_url <- str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)
  
  df1 <- read_html(trump_text_url) %>%
    html_nodes("p") %>%
    html_text()
  
  df2 <- read_html(trump_text_url)%>%
    html_node(".docdate")%>%
    html_text()
  
  speech <- data_frame(text = df1) %>%
    mutate(author = "Trump",
           parnumber = row_number(),
           date = df2) %>%
    separate(date, into = c("date2", "year"), sep = ",") %>%
    separate(date2, into = c("month", "day"), sep = " ")
  speech <- unnest_tokens(speech, word, text, token = "words")  
  return(speech)
}

# does it work for a single speech?
get_Trump_speeches(119182)

# okay let's do it for all speeches
## store ids as a numeric vector because they are numbers
x = c(119182,
  119181,
  119188,
  119187,
  119186,
  119185,
  119184,
  119183,
  119174,
  119172,
  119180,
  119173,
  119170,
  119169,
  119168,
  119167,
  119166,
  119179,
  119202,
  119201,
  119200,
  119203,
  119191,
  119189,
  119192,
  119207,
  119208,
  119209,
  119190,
  119206,
  119206,
  119193,
  119205,
  119178,
  119204,
  119194,
  119195,
  119177,
  119197,
  119199,
  119198,
  119196,
  119176,
  119175,
  119165,
  119503,
  117935,
  117791,
  117815,
  117790,
  117775,
  117813,
  116597)

# now let's use map_df to iterate over all of them and create an id variable
speeches <- map_df(x, get_Trump_speeches, .id = "docnumber")

brianp1 · 2016-12-03T03:58:09Z

Hey Dr. Soltoff.

I committed and pushed up an R markdown titled Rough Draft. It is all the code on my project up until this point. I looked through it and took notes on how to improve it and where to make things more clear. If you have a chance, I would greatly appreciate feedback on it. I am not sure if you can look at it simply in the repo or if I need to make a pull request.

Thanks
Brian

brianp1 · 2016-12-06T21:53:37Z

Is there a potential way to include multiple position arguments such as jitter and dodge. Also, is there a way to have the postion dodge determined by a particular variable.

speech_corpus_affin %>%
  mutate(month = factor(month, levels = month.name))%>%
  group_by(author, docnumber, month, year) %>%
  summarize(sum(score)) %>%
  ggplot(aes(month, `sum(score)`, fill= year)) +
  geom_bar(aes(width = .25), stat = "identity", alpha = .8, position = "dodge") +
  facet_wrap(~author)

Also, I was wondering if you could help me figure out the function by which I would mutate the data in order to get percentage of sentiment at a given time. I am not sure if that makes sense. I'll try to figure out what it is I am trying to look at. Trying to figure out the sentiment according to each month from each author as it changes over time. I'll play around with this and if you have any ideas or suggestions, I'm all ears

speech_corpus_affin %>%
  mutate(month = factor(month, levels = month.name))%>%
  group_by(author, docnumber, month, year) %>%
  summarize(sum(score)) %>%
  group_by(author, month, year) %>%
  mutate(sum_sent = sum(month))

bensoltoff · 2016-12-07T03:06:42Z

I'm not sure why you would want to use dodge and jitter on the same layer. position = "dodge" is intended for bar charts. position = "jitter" is intended for scatterplots. You shouldn't need them on the same graph. Dodging the position simply splits a stacked bar chart into a dodged bar chart so each bar begins at the same origin point on the y-axis. I don't understand what you mean by having "postion dodge determined by a particular variable".

As for the second question, what is the denominator for the percentage? Right now you have summarized it by adding all the sentiment scores - the negative and positive values cancel out. In order to create a percentage, you need a numerator and a denominator. The numerator would be the aggregated sentiment score, but I don't know conceptually what the denominator should be.

brianp1 · 2016-12-08T10:19:20Z

I am having trouble with rendering the site. I thought everything was in order, and I have the markdowns, and the YAML files I took from the tutorial, yet I get this error:

Error in yaml::yaml.load(string, ...) :
Parser error: while parsing a block mapping at line 1, column 1did not find expected key at line 6, column 3

I was able to knit my markdowns earlier as well, but now this error pops up as well

bensoltoff · 2016-12-08T13:31:18Z

This happens after running rmarkdown::render_site()?

brianp1 · 2016-12-08T14:32:06Z

This happened after I tried to update my tabs in the YAML file

bensoltoff · 2016-12-08T15:17:58Z

Copy and paste the YAML content you tried to add here.

brianp1 · 2016-12-08T15:18:38Z

name:"Textual Analysis of the 2016 Presidential Campaign Speeches"
output_dir: "."
navbar:
  title:"I Know Words, I Have The Best Words"
  left:
    - text: "Home"
      href: index.html
    - text: "About"
      href: about.html
    - text: "Sentiment Analysis"
      href: Sentiment.html
    - text: "Topic Modeling"
      href: Topic_Modeling.html

bensoltoff · 2016-12-08T15:21:06Z

name: "Textual Analysis of the 2016 Presidential Campaign Speeches"
output_dir: "."
navbar:
  title: "I Know Words, I Have The Best Words"
  left:
    - text: "Home"
      href: index.html
    - text: "About"
      href: about.html
    - text: "Sentiment Analysis"
      href: Sentiment.html
    - text: "Topic Modeling"
      href: Topic_Modeling.html

You need to add spaces between name: and "Textual.... Same thing for the title.

brianp1 · 2016-12-08T15:26:10Z

Alright, I updated that and I get the same error message when running
render_site()
as well as attempting to knit my document

bensoltoff · 2016-12-08T15:28:14Z

Make sure the YAML file is saved as _site.yml. When I used the YAML file above, your site rendered fine (well, I got a different error related to one of your plots but that is an entirely different issue).

brianp1 · 2016-12-08T15:30:12Z

Duh, I need to save it. Do you know which plot it is? That way I can take a look at it while this thing is rendering?

bensoltoff · 2016-12-08T15:31:56Z

Nope, because your chunks are so large there are multiple plots in each one. Break it into smaller chunks and then when you try to knit the document it will tell you which chunk caused the error.

brianp1 · 2016-12-08T15:32:32Z

okie doke. will do

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on Topic Modeling #1

Questions on Topic Modeling #1

brianp1 commented Dec 1, 2016

bensoltoff commented Dec 1, 2016

brianp1 commented Dec 1, 2016

bensoltoff commented Dec 1, 2016

brianp1 commented Dec 2, 2016

brianp1 commented Dec 2, 2016

bensoltoff commented Dec 2, 2016

brianp1 commented Dec 2, 2016 •

edited

Loading

bensoltoff commented Dec 2, 2016

bensoltoff commented Dec 2, 2016

brianp1 commented Dec 3, 2016

brianp1 commented Dec 6, 2016

bensoltoff commented Dec 7, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

Questions on Topic Modeling #1

Questions on Topic Modeling #1

Comments

brianp1 commented Dec 1, 2016

bensoltoff commented Dec 1, 2016

brianp1 commented Dec 1, 2016

bensoltoff commented Dec 1, 2016

brianp1 commented Dec 2, 2016

brianp1 commented Dec 2, 2016

bensoltoff commented Dec 2, 2016

brianp1 commented Dec 2, 2016 • edited Loading

bensoltoff commented Dec 2, 2016

bensoltoff commented Dec 2, 2016

brianp1 commented Dec 3, 2016

brianp1 commented Dec 6, 2016

bensoltoff commented Dec 7, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

bensoltoff commented Dec 8, 2016

brianp1 commented Dec 8, 2016

brianp1 commented Dec 2, 2016 •

edited

Loading