This document sketches a tentative development outline of a prospective R package 🚀
-
📁 You can clone it from here.
-
📈 The flowchart is in the folder
flowchart
(.mm format, edit it with Docear) -
📜 the statistical resources are in
references
-
🔍 original functions are in
base-functions
-
🔥 our work is in
seaglass-functions
-
🌟 Evaluate documentation badges (for later): here.
-
🚸 We need to read on the difference between S3 and S4 classes in R (example)
-
💡 you can push commits from your device as collaborator or from the VM using the
terminal window
within Rstudio. I configured ssh access (should work):
git commit -am "vm 11/01/2021" #check branching
git remote add origin [email protected]:jtmart/seaglass.git
☑️ Here is a technical outline of the five
intended functions and one test dataset
(Modi's Mann ki Baats because it's not too voluminous)
-
📦 What?
worduse
generates a new dfm from an original dfm. It computes how likely it is for words in a sub-corpus to be overrepresented or underrepresented when compared to the entire corpus. -
💎 Why?
worduse
is simple yet powerful function to identify the lexicon specific to a speaker, an institution, a period etc. -
👽 Alt name: pebble
-
⚡ Requires:
quanteda
and (probably)dplyr
-
🔮 Usage of
worduse
:output <- worduse(adfm, method = c("score", "prop", "chi2", "exact", "lr", "pmi"), output = c("dfm", "data.frame", "matrix", "tripletlist", "lda", "tm", "stm", "austin", "topicmodels"), #sort = c("descending", "ascending", "none"), correction = c("default", "yates", "williams", "none"), na.rm = TRUE, group = NULL, parallel = TRUE, verbose = TRUE, target = 1L)
-
🎨 Arguments of
worduse
:-
adfm: a dfm containing the words (tokens/features) to be examined for overrepresentation/underrepresentation (we we call that
representation
, in quanteda the call thatkeyness
). -
method: association measure to be used for computing representation (more below).
score
is default if nothing is specified. -
sort: sorts features scored in descending order of the measure, or in ascending order, or leaves in original feature order.
descending
is default if nothing is specified. -
correction: if
"default"
, Yates correction is applied to"chi2"
; William's correction is applied to"lr"
; and no correction is applied for the"score"
,"frequency"
,"exact"
and"pmi"
measures. Specifying a value other than the default can be used to override the defaults, for instance to apply the Williams correction to the chi2 measure. Specifying a correction for the"exact"
and"pmi"
measures has no effect and produces a warning. -
na.rm: logical; if
TRUE
missing values (including NaN) are omitted from the calculations.TRUE
is default if nothing is specified. -
group: calls the
dfm_group
function described here.NULL
is default. -
parallel: logical; if
TRUE
all the cores are used for computation.TRUE
is default if nothing is specified. -
verbose: logical; provides some sense of the progress of the computation.
-
target: the document index (numeric, character or logical) identifying the document forming the "target" for computing keyness; all other documents' feature frequencies will be combined for use as a reference.
-
-
📏 Description of the
method
parameters:Calculate the representation of a word being present
f
times in a sub-corpus oft
words given that it appears a total ofF
times in a whole corpus ofT
words. The function calculates representation in six different ways.- Default is
score
, i.e. the specificity - or association or surprise - score. 👉 See Leon pp.80- and TXM manual pp.95- for the statistical model (other relevant material is in French only). For other kind of calculations see Tribble and Bondi and J93.
-
Existing R family of functions is called
specificities
in the packagetextometry
available here and here. -
To open the function in Rstudio:
library(textometry) View("specificities.distribution.plot") #for an overview of the calculation View("specificities.probabilities") #for an overview of the implementation in a dfm (see below) # ❗ the object "lexicaltable" is a dataframe that has been produced by the software TXM. It's a dataframe where words are in rows and variables in columns. What we want is a transposed version of variables in rows and words in columns. The data type should be dfm.
- The first alternative is
prop
(relative frequencies aka proportions):
- The idea is to reuse the function
adorn_percentages
of the package janitor available here and apply it to a dfm object. No need of creating an additional dependency, we could take the function and rewrite our own. - Outputs are going to be very small numbers: should we implement a multiplier?
- Other alternatives can be taken from quanteda's function
textstat_keyness
available here. ✅ Idem for argumentscorrection
andtarget
, we can replicate that function.
- Default is
-
📐 Description of the
output
parameters:- Default is
dfm
, that means that no additional computation is needed - Alternative parameters will be covered through calling the
convert
function of quanteda here. We need to pipe the computation like this:original-dfm %>% new-dfm %>% converted-dfm
(ideally without having to call thedplyr/tidyverse
package).
- Default is
-
📦 What?
wordcontext
returns a list of categories words associated to another list of categories words (dictionary) in a dfm. -
💎 Why?
wordcontext
identifies related words of a dictionary based on their distance or association. It acts as a contextualizer of generic seed words proxying a particular theme, emotion, ideology, psychological state etc. -
👽 Alt name: seastar
-
⚡ Requires:
quanteda
and (again probably)dplyr
-
🔮 Usage of
wordcontext
:output <- worduse(adfm, adictionary, method = c("txm", "mi", "dice", "ll", "context.xxx"), sort = c("descending", "ascending", "none"), threshold = NULL, integer, window = NULL, integer, na.rm = TRUE, parallel = TRUE, verbose = TRUE)
🎨 Arguments of
wordconext
:-
adfm: ibid. (see supra).
-
adictionary: it is a nested list built using the function
dictionary
of quanteda (here). 💥 Few caveats here:- For testing purposes, we should make this function work with character vectors first. Then we can generalize.
- We have a quasi-working
python script
(thanks to Nirav) to turn a CSV into a YML dictionary. We could consider taking enabling the integration of the raw CSV (for later).
-
method: significance measurement method used to extract semantic links between words.
txm
is default if nothing is specified (even though it will not be developed after the others). -
sort: ibid. (see supra).
-
threshold: defines a cut-off point defined by the user.
NULL
(no cut-off) is default if nothing is specified. -
window: how many words from both (left and right) sides of a word are checked for cooccurrence.
NULL
(no window) is default if nothing is specified. 🚩 We need an integer here when using thecontext.xxx
parameter inmethod
argument. We might have to stick toNULL
for other parameters. -
na.rm: ibid. (see supra).
-
parallel: ibid. (see supra).
📏 Description of the
method
parameters:In order to not only count joint occurrence we have to determine their significance. Different significance-measures can be used.
-
For
mi
,dice
andll
we can implement this tutorial. I believe it is the easiest way, so we can start with that. -
For
context.xxx
we can implement this tutorial, but instead of nesting the functiontextstat_keyness
we are going to nest our freshworduse
one 🥳 !xxx
has to be one of themethod
parameters of the functionworduse
. I copy-paste an example taken from a recent work I did (don't pay attention to object names):nostoptokssubworkcorpus <- tokens_select(ngramstokssubworkcorpus, pattern = stopwords('en'), selection = 'remove', case_insensitive = TRUE) notpuncnostoptokssubworkcorpus <- nostoptokssubworkcorpus %>% tokens_remove('[\\p{P}\\p{S}]', valuetype = 'regex', padding = TRUE) v2.1 <- c("abnormal", "abnormality", "absolute", "absolutely", "accept", "acceptance", "accepted", "accepting", "accepts", "accountability", "accurate","accurately", "acknowledge", "acknowledge", "activate","activate", "actually", "adjust", "adjusting","adjusting", "admit", "admits", "admitted", "admitting", "affect", "affected", "affecting", "affects", "afterthought", "afterthoughts", "against", "aggravate","aggravates","aggravating", "aggravated", "ain't", "aint", "all", "allot", "almost", "allow", "allows", "allowing", "allowed", "alot", "alternative", "alternatives", "although", "altogether", "always", "ambiguous", "ambiguity", "ambiguity", "analysis","analyses", "analytical", "analytic", "answer", "answers", "any", "anybody", "anyhow", "anyone","anyones", "anything", "anytime", "anywhere", "apart", "apparent", "apparently", "appear", "appeared", "appearing", "appears", "appreciate","appreciated", "apprehensive", "approximate", "approximated", "approximation", "approximatively", "arbitrary", "aren't", "arent", "assume", "assumes", "assure","assures", "assurance", "attention", "attentive", "attentionate", "attribute","attributes", "aware", "awareness", "barely", "based", "basis", "bc", "became", "because", "become", "becomes", "becoming", "belief", "beliefs", "believe", "believed", "believes", "believing", "besides", "bet", "bets", "betting", "blatant", "blatantely", "blur","blurred", "bosses", "but", "can't", "cannot", "cant", "category", "categories", "cause", "caused", "causes", "causing", "certain", "chance", "chances", "change", "change","changed", "changes", "changing", "choice", "choices", "choose", "chooses", "clarify", "clarified", "clarification", "clear", "clearly", "closure", "clue", "coherent", "coherence", "commit", "commited", "commitment", "commitments", "commits", "committed", "committing", "compel", "compels", "compelling", "complete", "completed", "completely", "completes", "complex", "complexity", "compliance", "compliant", "complicate", "complicated", "complicates", "complicating", "complication", "complications", "complied", "complies", "comply", "complies", "complying", "comprehsive", "comprehend", "comprehending", "concentrate", "concentrating", "concentrated", "conclude", "concluded", "concluding", "conclusion", "conclusions", "concluded", "conclusive", "confess", "confessed", "confession", "confidence", "confident", "confidently", "confuse", "confused", "confuses", "confusing", "confusion", "confusions", "conscious", "consciously", "consequence", "consequences", "consider", "consideration", "considered", "considering", "considers", "contemplate", "contemplating", "contingent", "control", "convince", "convinces", "convinced", "convincing", "correct", "correction", "corrections", "correlate", "correlates", "correlation", "cos", "could", "could've", "couldn't", "couldnt", "couldve", "coz", "create", "created", "creates", "creating", "creation", "creations", "creative", "creativity", "curious", "curiosity", "curiosly", "cuz", "deceive", "deceives", "deceiving", "decide", "decides", "deciding", "decided", "decides", "deciding", "decision", "decisions", "decisive", "deduction", "deductive", "deductions", "deductively", "define", "defined", "defines", "defining", "definite", "definitely", "definition", "definitive", "definitively", "depend", "depended", "depending", "depends", "desire", "desires", "desirable", "despite", "determination", "determine", "determined", "determines", "determining", "diagnose", "diagnoses", "diagnosed", "diagnosis", "didn't", "didnt", "differ", "differed", "difference", "differences", "different", "differential", "differentiation", "differentiated", "differently", "differing", "differs", "directly", "discern", "discerns", "discerning", "disclose", "disclosed", "disclosing", "discover", "discovers", "discovering", "disillusion", "disillusions", "disorient", "disorients", "disorienting", "dissimilar", "distinct", "distinctive", "distinguish", "distinguishes", "distinguishing", "distract", "distracted", "distraction", "doubt", "doubts", "doubting", "dreams", "dubious") toksv2.1 <- tokens_keep(notpuncnostoptokssubworkcorpus, phrase(v2.1), window = 10, valuetype="fixed") toksnov2.1 <- tokens_remove(notpuncnostoptokssubworkcorpus, phrase(v2.1), window = 10, valuetype="fixed") dfmattoksv2.1 <- dfm(toksv2.1) #head(toksnov2.1) dfmattoksv2.1perpm <- dfm_group(dfmattoksv2.1, groups = "loc") dfmattoksnov2.1 <- dfm(toksnov2.1) dfmattoksnov2.1perpm <- dfm_group(dfmattoksnov2.1, groups = "loc") tstatkeyv2.1 <- textstat_keyness(rbind(dfmattoksv2.1perpm, dfmattoksnov2.1perpm), seq_len(ndoc(dfmattoksv2.1perpm))) tstatkeyv2.1subset <- tstatkeyv2.1[tstatkeyv2.1$n_target > 10, ] #head(tstatkeyv2.1subset, 1000) bon<-c() bon2<-tstatkeyv2.1subset for(j in 1:length(v2.1)){ bon <- c(bon,which(tstatkeyv2.1subset[,1]==v2.1[j])) } bon<-sort(bon, decreasing=TRUE) bon2<-bon2[-c(bon),]
-
For
txm
we have to somehow implement the code flow of point 1 with thell
parameter (details here), except that instead of a log-base we have to take a log-10 as explained here. Additional hints here and maybe around here.
-
-
📦 What?
wordclusters
generates latent semantic contexts, and outputs from acorpus
(not adfm
this time) two sets of contributions: contributions of words to topics, and contributions of topics to sub-corpuses. -
💎 Why?
wordclusters
helps identifying the lexicon specific to a speaker, an institution, a period etc. -
👽 Alt name: atoll
-
⚡ Requires:
quanteda
,stm
...for visualisation we additionally needdplyr
,ggplot2
,ca
(alternatives tostm
aretopicmodels
andcluster
) -
🔮 Usage of
wordclusters
:output <- wordclusters(acorpus, method = c("chd", "lda"), k = FALSE, integer, segmentation = FALSE, integer, minfreq = [0:1], maxfreq = [0:1], cleaning = c("ocr", "numbers", "punctuation", "symbols", "twitter", "url", "hyphens", "docvars", "stem", "clean", "verbose", "stopwords", "tolower"), #see NB1 no.words = integer, sort.gamma = c("descending", "ascending", "none"), plot = NULL, c("ca","dendrogram"), plot.param = [1:3], window = NULL, integer, parallel = TRUE, na.rm = TRUE, verbose = TRUE)
-
NB1: an upcoming tutorial for this cluster should mention cleaning procedures of the corpus ahead of calling the function. Examples can be found here, here, here, here (I also have done some cleaning, can provide my method). Otherwise, we can be ambitious and integrate the cleaning in the argument
cleaning
. Four options will be wrappers of various parameters of three quanteda functions:tokens
,tokens_select
,dfm
anddfm_trim
. Here is an example:# Preprocess the text# Create tokens token <- tokens( mycorpus, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove_hyphens = TRUE, include_docvars = TRUE ) # Clean tokens created by OCR token_ungd <- tokens_select( token, c("[\\d-]", "[[:punct:]]", "^.{1,2}$"), selection = "remove", valuetype = "regex", verbose = TRUE ) mydfm <- dfm(token_ungd, tolower = TRUE, stem = TRUE, remove = stopwords("english") ) # Filter words that appear less than 7.5% and more than 90%. mydfm.trim <- dfm_trim( mydfm, min_docfreq = 0.075, # min 7.5% max_docfreq = 0.90, # max 90% docfreq_type = "prop" )
-
NB2: We want to be able to generate two forms of graphical output right after running
wordclusters
. We can deal with that at the end when the rest is ready. They will look like this, or like a combination ofbeta
andgamma
visualisation (more on that later):- NB3: We will start by implementing the
lda
parameter inmethod
because I think it is the easiest one.
- NB3: We will start by implementing the
🎨 Arguments of wordclusters
:
-
acorpus: The corpus is passed some cleaning arguments (if we take that route) and is segmented (only in the
method
's parameterchd
) before being converted into a dfm with the quanteda functiondfm
. All the other arguments will pass the dfm. -
method: clustering measurement method.
chd
is default if nothing is specified (even though it will not be developed after thelda
parameter). -
k: the number of clusters to be generated.
likelihood
will take the number of topics that has the highest Held-out likelihood andresiduals
will take the number of topics that has the lowest residuals as explained here.Silhouette
andgap_statistics
method will take the number of topics that has the highest average silhouette width (implemented here). Alternative forgap_statistics
here. Default should besilhouette
orgap_statistics
because they are in base R. -
segmentation: this option applies to
chd
method only. For the latter, default is 40 words as indicated here. -
minfreq and maxfreq: they filter out words that have a very high frequency (banality) or a very low frequency (insignificance) in each document of the dfm. It is based on the function
dfm_trim
of quanteda available here. 🛑 we have to to some test to understand what default should be. My feeling is to have not too conservative bounds, but only if the rest is computationally efficient. -
cleaning: see supra.
-
no.words: how many words per topic to generate. I am of the opinion that the default number should be low, i.e. 20.
-
sort.beta: sort the highest or lowest contributing words to a topic. Default is
descending
. -
sort.gamma: sort the highest or lowest contributing topics to a document. Default is
descending
. -
na.rm: ibid. (see supra).
-
parallel: ibid. (see supra).
📏 Description of the method
parameters:
-
For
lda
we can implement this tutorial. I believe it is the easiest way, so we can start with that. Additional info here, and here. Other methods we will probably not use: here, here and here. -
For
chd
we can implement in part this technique but make it work with a dfm out of the box. A hint of the visualisation route with theca
package is described below:# Prerequisite 1: open the folder Rscripts and open the file CHD.R # Prerequisite 2: open the folder DF and check the outputs source("/usr/share/iramuteq/Rscripts/CHD.R") source("/usr/share/iramuteq/Rscripts/chdtxt.R") source("/usr/share/iramuteq/Rscripts/anacor.R") source("/usr/share/iramuteq/Rscripts/Rgraph.R") nbt <- 9 library(irlba) svd.method <- 'irlba' libsvdc.path <- NULL mode.patate = FALSE library(Matrix) data1 <- readMM("/data/ir/imports/b/bm_1/bmk_1/pmga12_alphanumerical_mini_compact_noempty_alceste_5/TableUc1.csv") data1 <- as(data1, "dgCMatrix") row.names(data1) <- 1:nrow(data1) chd1<-CHD(data1, x = nbt, mode.patate = mode.patate, svd.method = svd.method, libsvdc.path = libsvdc.path) #lecture des uce listuce1<-read.csv2("/data/ir/imports/b/bm_1/bmk_1/pmga12_alphanumerical_mini_compact_noempty_alceste_5/listeUCE1.csv") rm(data1) classif_mode <- 1 mincl <- 2 uceout <- "/data/ir/imports/b/bm_1/bmk_1/pmga12_alphanumerical_mini_compact_noempty_alceste_5/uce.csv" if (classif_mode == 0) { chd.result <- Rchdtxt(uceout, chd1, chd2 = chd2, mincl = mincl,classif_mode = classif_mode, nbt = nbt) } else { chd.result <- Rchdtxt(uceout, chd1, chd2 = chd1, mincl = mincl,classif_mode = classif_mode, nbt = nbt) } n1 <- chd.result$n1 classeuce1 <- chd.result$cuce1 classes<-n1[,ncol(n1)] write.csv2(n1, file="/data/ir/imports/b/bm_1/bmk_1/pmga12_alphanumerical_mini_compact_noempty_alceste_5/n1.csv") rm(n1) tree.tot1 <- make_tree_tot(chd1) # open_file_graph("/data/ir/imports/b/bm_1/bmk_1/pmga12_alphanumerical_mini_compact_noempty_alceste_5/arbre_1.png", widt = 600, height=400) # plot(tree.tot1$tree.cl) # dev.off() tree.cut1 <- make_dendro_cut_tuple(tree.tot1$dendro_tuple, chd.result$coord_ok, classeuce1, 1, nbt) save(tree.cut1, file="/data/ir/imports/b/bm_1/bmk_1/pmga12_alphanumerical_mini_compact_noempty_alceste_5/dendrogramme.RData") open_file_graph("/data/ir/imports/b/bm_1/bmk_1/pmga12_alphanumerical_mini_compact_noempty_alceste_5/dendro1.png", width = 600, height=400) plot.dendropr(tree.cut1$tree.cl,classes, histo=TRUE) open_file_graph("/data/ir/imports/b/bm_1/bmk_1/pmga12_alphanumerical_mini_compact_noempty_alceste_5/arbre_1.png", width = 600, height=400) plot(tree.cut1$dendro_tot_cl) dev.off() #save.image(file="/data/ir/imports/b/bm_1/bmk_1/pmga12_alphanumerical_mini_compact_noempty_alceste_5/RData.RData")
-
📦 What?
cosinedocs
determines how close two documents to each other in lexical, semantic and stylistic senses. -
💎 Why?
wordclusters
applies the Python Doc2Vec vectorised approach and implement it in R. A pre-implementation has been already developed by Srija Anand. -
👽 Alt name: remora
-
⚡ Requires:
reticulate
anddplyr
(R), and few Python libraries (see below) -
🔮 Usage of
wordclusters
:output <- worduse(adataframe, groupbywhat = NULL, columnname, mostsimilar = integer, #default is 2 outpurdir = "/path/nameofdir" max_epochs = integer, #default is 20 vec_size = integer, #default is 250 alpha = integer, #default is 0.025 min_alpha= integer, #default is 0.00025 min_count= integer, #default is 1 dm = integer) #default is 1
🎨 Arguments of
cosinedocs
:- adataframe: a data frame in which the text is in one column.
- groupbywhat: calls the
group_by
function ofdplyr
and merges rows accordingly. Default isNULL
. - mostsimilar: tells in the python script how many similar documents to print for each queried document.
- outputdit: tells where to store the output models of Doc2Vec.
- the other arguments are
training
paramaters of thedoc2vec
implementation withgensim
in Python.
📏 Description of the preliminary
R
part:-
Calls the
group_by
function and store the output in a temporary object:tempdataframe <- adataframe %>% group_by(columnname)
-
Exports the
tempdataframe
to atempdirectory
:mkdir(/path/nameofdir) tempdataframe <- tempdataframe %>% select(columnname, text) #text is the column name where the text is setwd("/path/nameofdir") invisible(lapply(1:nrow(tempdataframe), function(i) write.table(tempdataframe[i,2], file = paste0(tempdataframe[i,1], ".txt"), row.names = FALSE, col.names = FALSE, quote = FALSE))) rm(tempdataframe)
-
Trains the model in Python though reticulate:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize import nltk import os.path import multiprocessing nltk.download('punkt') directory=r'/path/nameofdir' docLabels=[] docLabels=[f for f in os.listdir(directory)] # print (docLabels) data=[] for doc in docLabels: words=open(directory+'//'+ doc, encoding='utf-8').read() words=words.strip() words=words.split() # print (words) tags = [doc] data.append(TaggedDocument(words=words, tags=tags)) # print (data[0]) # encoding='cp1252' # tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)] cores=multiprocessing.cpu_count() print (cores) max_epochs = 20 vec_size = 250 alpha = 0.025 model = Doc2Vec(vector_size=vec_size, workers=cores, alpha=alpha, min_alpha=0.00025, min_count=1, dm =1) model.build_vocab(data) # train_documents,total_examples=len(train_documents), epochs=30 model.train(data,total_examples=model.corpus_count,epochs=max_epochs) # =model.iter) # # decrease the learning rate # model.alpha -= 0.0002 # # fix the learning rate, no decay # model.min_alpha = model.alpha model.save("/path/nameofdir2/d2v.model") print("Model Saved")
-
Trains the model in Python though reticulate:
from gensim.models.doc2vec import Doc2Vec from nltk.tokenize import word_tokenize import nltk import os nltk.download('punkt') model= Doc2Vec.load("/path/nameofdir2/d2v.model") #to find the vector of a document which is not in training data # test_data = word_tokenize("Six days ago, my colleagues and I sat on the chairs of high office in the Government of India.".lower()) directory=r'/path/nameofdir' docLabels=[] docLabels=[f for f in os.listdir(directory)] for doc in docLabels: test_data=open(directory+'//'+ doc, encoding='utf-8').read() print (doc) test_data=test_data.strip() test_data=test_data.split() test_data=list(test_data) v1 = model.infer_vector(test_data) similar_doc = model.docvecs.most_similar([v1]) #print(similar_doc[0], similar_doc[1]) print(similar_doc[0:10]) #will print the 10 most similar per locutor
-
Remove temp files:
clean(dirs = c("/path/nameofdir"), force = TRUE)
-
📦 What?
wordexplorer
is a simple shiny explorer for browsing the context of a word, phrase or regular expression. It offers the possibility to examine in a split screen two different search queries in two different corpuses. This is particularly useful to translated corpuses. -
💎 Why?
wordexplorer
takes words, glob or regular expressions and displays their left and right context -
👽 Alt name: scuba
-
⚡ Requires:
shiny
,quanteda
-
🔮 Usage of
wordexplorer
:wordexplorer:::wordexplorer(kwic(acorpus, #we could simplify it here pattern = "love", #an example window = 3, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, rowname = NULL, c("minister", "covidtw"), tr.pattern = NULL, acorpustranslated, tr.pattern = NULL,"प्यार", #an example tr.window = NULL, 3, tr.valuetype = NULL, c("glob", "regex", "fixed"), tr.case_insensitive = NULL, TRUE, tr.rowname = NULL, c("minister", "covidtw"), )) #these are examples
🎨 Arguments of
wordexplorer
:-
rowname
is passed first, it calls the functionfilter
fromdyplyr
.acorpustemporary <- acorpus %>% filter(type == "minister" | type == "covidtw")
-
all the other arguments ones are based on the function
kwic
of quanteda; they are explained here and here. -
When the shiny window is exited, the temporary object is removed:
rm(acorpustemporary)
. -
If any of the
tr.xxx
parameters are other thanNULL
the shiny window opens in split screen (left/right).tr.xxx
parameters are based similarly on thekwic
function of quanteda.
-
🌓 Tutorials to do parallel computing in R: probably the best, here, here, here
🌗 Tutorials to build a Shiny app: probably the simpliest, here, here.