-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions on Topic Modeling #1
Comments
speech_td %>%
anti_join(stop_words, by = c(term = "word")) You should get a data frame with the stop words removed. You can directly look at |
Differential number of speeches per candidate isn't a bad thing. In your graphs, present the bars as a percentage of each candidate's speeches allocated to each emotion, rather than raw frequency count. This will normalize for the total number of words in each candidates' corpus and allow you to compare relative affective content between candidates. You could use n-grams for topic modeling (not sentiment analysis, well certainly not in an easy manner that can be done before the project is due), especially if key phrases or slogans are used repeatedly (#MAGA). |
I'll look through the n-gram literature in a minute.
But, when I go to filter the document to make sure these words are no longer there, I get this error message:
I am not sure what is going on. Also, I am trying to get an average of this count, essentially the number of times there is a pause for chanting or applauding divided by the number of speeches given:
Sorry, if this is vague, I can try to be more specific. |
Also, is the seed variable just a random number generator or is it something I am supposed to calculate? |
The seed is basically a random number generator. Set it once at the beginning of the script ( What exactly is the code you are using to merge For the last question, I think this code will work (it does in my head at least): speech_corpus %>%
group_by(author, docnumber) %>%
filter(word == "applause" | word == "cheers") %>%
count() %>%
group_by(author) %>%
mutate(n_per_speech = n / n()) |
Thank you for the assistance.
However, I keep getting the contractions in the Sanders result. Is it a problem with the apostrophes? I even tried adding spaces before and after the words to see if that would make a difference but it didn't. Also, I am working on cleaning up my web scraping process and I just can't seem to get it to work, and I think at this point, I have just been starring at it too long. Here is the code:
Also, I just wanted to say thank you sooo much for all your assistance today. You really have taught me a lot, and I am truly appreciative for the continued guidance. |
On the first issue, If you stick to topic modeling or predicting candidate based on their text, contractions are not a problem. If you want to do sentiment analysis, check out the |
mutate(trump_text_url = str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)) This is your problem. Also, you're making it inefficient by creating a separate vector for library(tidyverse)
library(rvest)
library(stringr)
library(tidytext)
get_Trump_speeches <- function(x){
trump_text_url <- str_c("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)
df1 <- read_html(trump_text_url) %>%
html_nodes("p") %>%
html_text()
df2 <- read_html(trump_text_url)%>%
html_node(".docdate")%>%
html_text()
speech <- data_frame(text = df1) %>%
mutate(author = "Trump",
parnumber = row_number(),
date = df2) %>%
separate(date, into = c("date2", "year"), sep = ",") %>%
separate(date2, into = c("month", "day"), sep = " ")
speech <- unnest_tokens(speech, word, text, token = "words")
return(speech)
}
# does it work for a single speech?
get_Trump_speeches(119182)
# okay let's do it for all speeches
## store ids as a numeric vector because they are numbers
x = c(119182,
119181,
119188,
119187,
119186,
119185,
119184,
119183,
119174,
119172,
119180,
119173,
119170,
119169,
119168,
119167,
119166,
119179,
119202,
119201,
119200,
119203,
119191,
119189,
119192,
119207,
119208,
119209,
119190,
119206,
119206,
119193,
119205,
119178,
119204,
119194,
119195,
119177,
119197,
119199,
119198,
119196,
119176,
119175,
119165,
119503,
117935,
117791,
117815,
117790,
117775,
117813,
116597)
# now let's use map_df to iterate over all of them and create an id variable
speeches <- map_df(x, get_Trump_speeches, .id = "docnumber") |
Hey Dr. Soltoff. I committed and pushed up an R markdown titled Rough Draft. It is all the code on my project up until this point. I looked through it and took notes on how to improve it and where to make things more clear. If you have a chance, I would greatly appreciate feedback on it. I am not sure if you can look at it simply in the repo or if I need to make a pull request. Thanks |
Is there a potential way to include multiple position arguments such as jitter and dodge. Also, is there a way to have the postion dodge determined by a particular variable.
Also, I was wondering if you could help me figure out the function by which I would mutate the data in order to get percentage of sentiment at a given time. I am not sure if that makes sense. I'll try to figure out what it is I am trying to look at. Trying to figure out the sentiment according to each month from each author as it changes over time. I'll play around with this and if you have any ideas or suggestions, I'm all ears
|
I'm not sure why you would want to use dodge and jitter on the same layer. As for the second question, what is the denominator for the percentage? Right now you have summarized it by adding all the sentiment scores - the negative and positive values cancel out. In order to create a percentage, you need a numerator and a denominator. The numerator would be the aggregated sentiment score, but I don't know conceptually what the denominator should be. |
I am having trouble with rendering the site. I thought everything was in order, and I have the markdowns, and the YAML files I took from the tutorial, yet I get this error:
I was able to knit my markdowns earlier as well, but now this error pops up as well |
This happens after running |
This happened after I tried to update my tabs in the YAML file |
Copy and paste the YAML content you tried to add here. |
|
You need to add spaces between |
Alright, I updated that and I get the same error message when running |
Make sure the YAML file is saved as |
Duh, I need to save it. Do you know which plot it is? That way I can take a look at it while this thing is rendering? |
Nope, because your chunks are so large there are multiple plots in each one. Break it into smaller chunks and then when you try to knit the document it will tell you which chunk caused the error. |
okie doke. will do |
Not sure if this is the correct place to post a question, but here it goes:
I was reviewing the topic modeling code we went over in class as I was trying to figure out how to write the code for my final project.
what is the control doing,exactly?
The help file just says it controls the parameters.
Conceptually, my data is already in a tidy text format, is it redundant for me to run it through the cast_dtm function just to have it in the tidy format, or do I need the cast_dtm function in order to pass the document term matrix throught the lda function?
So, this chunk of code:
It organizes the terms with the highest beta within the topic, and this is how we determine the topics with the words with the highest beta?
The text was updated successfully, but these errors were encountered: