Steven Moran and Alena Witzlack-Makarevich (21 June, 2024)
- Introduction
- Getting the data
- Explore the data
select
what you needmutate()
character into factor- Get
languages
join()
some tables- Work with one language
- Stack some Germanic languages
- Explore another set of langauges
join()
predicate and data tables
library(tidyverse)
library(readxl)
library(readr)
require(forcats)
library(cowplot)
library(knitr)
library(kableExtra)
From the BivalTyp website:
BivalTyp is a typological database of bivalent verbs and their encoding frames. As of 2024, the database presents data for 129 languages, mainly spoken in Northern Eurasia. The database is based on a questionnaire containing 130 predicates given in context. Language-particular encoding frames are identified based on the devices (such as cases, adpositions, and verbal indices) involved in encoding two predefined arguments of each predicate (e.g. ‘Peter’ and ‘the dog’ in ‘Peter is afraid of the dog’). In each language, one class of verbs is identified as transitive. The goal of the project is to explore the ways in which bivalent verbs can be split between the transitive and different intransitive valency classes.
The data from BivalTyp are available in a GitHub repository created and maintained by Dmitry Nikolayev:
First, let’s figure out where the raw data are.
Here we find a number of CSV tabular data files:
What are in these data files? Let’s explore them.
OK, now how do we load the data into R/RStudio?
One way is to download them to your computer.
Open RStudio.But then you need to make sure that they are:
- Either in the same folder as the Rmd file you are working on (which means you’ve set the working directory to where you Rmd file is):
valency <- read_tsv('data_for_download.csv')
- You load the file from the working directory to where the data file is, e.g.:
valency <- read_tsv('data/data_for_download.csv')
- If the data are available online, e.g., in a GitHub repository (or
elsewhere), you can load the data directly into R/RStudio with the
url()
function and a URL:
valency <- read_tsv(url('https://raw.githubusercontent.com/macleginn/bivaltyp/master/data/data_for_download.csv'))
But be careful – it has to be the URL to the raw data, not the webpage! Click on the “Raw” button in GitHub to get the URL in the browswer window.
Another issue to remember – although the file is labeled CSV for “comma separated values”, the actual file is separated by tabs. Linguists often use TSV (i.e., tab separated values) for display purposes.
In R/RStudio you can specify the delimiter, e.g., that it is tab instead of comma. Comma is the default standard, but in the wild you will come across many different characters as delimiters.
Finally, notice, the file we started with is called
data_for_download.csv
. This is which is not a very informative label
for an R object. We pick a more telling name, e.g. valency
.
Every time you read in a data set, it’s a good idea to have a look at it from a few angles to make sure that there are no major issues with both the import and the data themselves before you move on to doing some real statistics.
Let’s start by looking at the variables with the function str()
(it
stands for “structure”):
str(valency)
## spc_tbl_ [16,770 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ language_no : num [1:16770] 60 60 60 60 60 60 60 60 60 60 ...
## $ predicate_no : num [1:16770] 1 2 3 4 5 6 7 8 9 10 ...
## $ verb : chr [1:16770] "χ’ə" "*" "[ĉ-]ŝa" "w-š’tə" ...
## $ X : chr [1:16770] "IO" "*" "ABS" "ERG" ...
## $ Y : chr [1:16770] "ABS" "*" "MAL" "ABS" ...
## $ locus : chr [1:16770] "X" "*" "Y" "TR" ...
## $ valency_pattern : chr [1:16770] "IO_ABS" NA "ABS_MAL" "TR" ...
## $ sentence : chr [1:16770] "á-č̣’ḳʷən j-qá jə́-χ’-əj-ṭ" "*" "á-č̣’ḳʷən a-lá d-a-ĉ-ŝ-ə́j-ṭ" "á-č̣’ḳʷən-ĉa-kʷa a-háqʷ-kʷa j-á-wə-r-š’t-əj-ṭ" ...
## $ glosses_en : chr [1:16770] "DEF-boy 3SG.M.IO-head 3SG.M.IO-ache-PRS-DCL" "*" "DEF-boy DEF-dog 3SG.H.ABS-3SG.N.IO-MAL-be_afraid-PRS-DCL" "DEF-boy-PL.H-PL DEF-stone-PL 3PL.ABS-3N.IO-LOC-3PL.ERG-throw-PRS-DCL" ...
## $ back_translation_en : chr [1:16770] "‘The boy has a headache.’" "*" "‘The boy is afraid of the dog.’" "‘The boys are throwing the stones.’" ...
## $ comms : chr [1:16770] NA "No satisfactory translation has been obtained." NA NA ...
## $ glosses_ru : chr [1:16770] "DEF-парень 3SG.M.IO-голова 3SG.M.IO-болеть-PRS-DCL" "*" "DEF-парень DEF-собака 3SG.H.ABS-3SG.N.IO-MAL-бояться-PRS-DCL" "DEF-парень-PL.H-PL DEF-камень-PL 3PL.ABS-3N.IO-LOC-3PL.ERG-бросить-PRS-DCL" ...
## $ back_translation_ru : chr [1:16770] "‘У парня болит голова.’" "*" "‘Парень боится собаки.’" "‘Мальчики бросают камни.’" ...
## $ verb_original_orthography : chr [1:16770] "хьы" "*" "[чв-]шва" "ау-щты" ...
## $ sentence_original_orthography: chr [1:16770] "А-чIкIвын й-хъА йЫ-хь-и-тI" "*" "А-чIкIвын а-лА д-а-чв-шв-И-тI" "А-чIкIвын-чва-ква а-хIАхъв-ква й-А-уы-р-щт-и-тI" ...
## - attr(*, "spec")=
## .. cols(
## .. language_no = col_double(),
## .. predicate_no = col_double(),
## .. verb = col_character(),
## .. X = col_character(),
## .. Y = col_character(),
## .. locus = col_character(),
## .. valency_pattern = col_character(),
## .. sentence = col_character(),
## .. glosses_en = col_character(),
## .. back_translation_en = col_character(),
## .. comms = col_character(),
## .. glosses_ru = col_character(),
## .. back_translation_ru = col_character(),
## .. verb_original_orthography = col_character(),
## .. sentence_original_orthography = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Another way to get some sense of the data is to use the function
head()
:
head(valency)
## # A tibble: 6 × 15
## language_no predicate_no verb X Y locus valency_pattern sentence
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 60 1 χ’ə IO ABS X IO_ABS á-č̣’ḳʷən j-…
## 2 60 2 * * * * <NA> *
## 3 60 3 [ĉ-]ŝa ABS MAL Y ABS_MAL á-č̣’ḳʷən a-…
## 4 60 4 w-š’tə ERG ABS TR TR á-č̣’ḳʷən-ĉa…
## 5 60 5 [z-]qa BEN ABS X BEN_ABS wəẑə́ zaréma…
## 6 60 6 apš ABS IO Y ABS_IO wará s-aχš’…
## # ℹ 7 more variables: glosses_en <chr>, back_translation_en <chr>, comms <chr>,
## # glosses_ru <chr>, back_translation_ru <chr>,
## # verb_original_orthography <chr>, sentence_original_orthography <chr>
Could you guess what the “sibling” function tail()
does?
Interpret the output of str()
and head()
:
-
How many variables are in the dataset?
-
Variables of which type are these?
-
How many rows does it contain?
-
Which variables are you likely to use for statistical analysis and which we probably won’t need for it?
(When a dataset has many variables it might be more convenient to work with a smaller version of it containing just what you need.)
The dataset valency
contains a lot of textual data (examples and their
translations), which won’t be used for any statistical analysis.
Let’s use the tidyverse
function select
and select only the
variables you might need for further analysis: you just list them one
after another as arguments of the function select()
.
Here we overwrite the object valency
. (Do not panic, you always have
access to the original dataset in case you need it.)
valency <- valency %>% select(language_no, predicate_no, X, Y, locus, valency_pattern)
It’s a good idea to check with the familiar functions whether you got
what you wanted. Notice an optional argument n = 3
added to the
function head()
. What does it do? Verify your intuition by changing
the value to e.g. n = 5
.
head(valency, n = 3)
## # A tibble: 3 × 6
## language_no predicate_no X Y locus valency_pattern
## <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 60 1 IO ABS X IO_ABS
## 2 60 2 * * * <NA>
## 3 60 3 ABS MAL Y ABS_MAL
To get a first idea of how much of everything is in the dataset, the
function summary()
is quite useful:
summary(valency)
## language_no predicate_no X Y
## Min. : 1 Min. : 1.0 Length:16770 Length:16770
## 1st Qu.: 33 1st Qu.: 33.0 Class :character Class :character
## Median : 65 Median : 65.5 Mode :character Mode :character
## Mean : 65 Mean : 65.5
## 3rd Qu.: 97 3rd Qu.: 98.0
## Max. :129 Max. :130.0
## locus valency_pattern
## Length:16770 Length:16770
## Class :character Class :character
## Mode :character Mode :character
##
##
##
As it turns out, summary()
on character data (our X
, Y
, locus
,
and valency_pattern
) does not yield much of use. Why is this the case?
In R, there is an important distinction between the two data types for storing textual data (i.e. data with words): character and factor.
We use character for textual data which do not represent categories/classes, e.g. example sentences and their glosses in the original large valency dataset. Textual data which represent classes or categories should be stored as factors in R and not as character data type.
For instance, in a patients dataset, patient names would be stored as
characters, as we don’t really care how many John
’s and Rose
’s are
there and whether they are more frequent than Lee
’s and Monica
’s. On
the other hand, male
and female
are categories from a list which
includes these and other possibilities which we would like to be able to
count for statistical analysis, for this reason we treat them as factors
in R.
Different functions to import data into R have different default
specifications: some generously treat all words as factors, others
conservatively treat all words as characters. What does the function
read_tsv()
(or read_csv()
) do? It assume that all words are
character and then it is up to you to declare which ones should be
factors. Let’s declare that X
, Y
, locus
, and valency_pattern
are
actually factors:
Let’s mutate to factor the variable X
valency <- valency %>% mutate(X = factor(X))
Compare the now factor X
to the still character variable Y
.
summary(valency)
## language_no predicate_no X Y
## Min. : 1 Min. : 1.0 NOM :7578 Length:16770
## 1st Qu.: 33 1st Qu.: 33.0 SBJ :3549 Class :character
## Median : 65 Median : 65.5 ERG :1877 Mode :character
## Mean : 65 Mean : 65.5 * :1397
## 3rd Qu.: 97 3rd Qu.: 98.0 DAT : 794
## Max. :129 Max. :130.0 ABS : 486
## (Other):1089
## locus valency_pattern
## Length:16770 Length:16770
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
Let’s change (mutate) the other character variables into factor and have a look at their summary:
valency <- valency %>% mutate(Y = factor(Y), locus = factor(locus), valency_pattern = factor(valency_pattern))
summary(valency)
## language_no predicate_no X Y locus
## Min. : 1 Min. : 1.0 NOM :7578 ACC :2766 * :1397
## 1st Qu.: 33 1st Qu.: 33.0 SBJ :3549 DO :1807 TR:7082
## Median : 65 Median : 65.5 ERG :1877 NOM :1725 X :1298
## Mean : 65 Mean : 65.5 * :1397 * :1397 XY: 344
## 3rd Qu.: 97 3rd Qu.: 98.0 DAT : 794 DAT :1075 Y :6649
## Max. :129 Max. :130.0 ABS : 486 ABS : 823
## (Other):1089 (Other):7177
## valency_pattern
## TR :7080
## NOM_DAT: 755
## DAT_NOM: 540
## NOM_INS: 270
## NOM_ABL: 265
## (Other):6463
## NA's :1397
It’s getting more interesting, but at this stage you probably realize that we have no idea what languages and what predicates you are dealing with in this table. These details are part of two separate datasets and before we embark on any serious exploration, we need to join these datasets.
The only information on languages in our valency
is some kind of ID
under language_no
, but what is language number 1
?
Let’s get the dataset with the detail on languages. We use the same function to read the dataset and process it a bit following the same procedure as above:
What do we get?
head(languages)
## # A tibble: 6 × 23
## language_no language_ru language language_external expert_ru consultant_ru
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 русский Russian Russian Сергей С… <NA>
## 2 2 арабский литер… Arabic_… Standard Arabic Рамазан … <NA>
## 3 3 гуарани Guarani… Paraguayan Guara… Дмитрий … <NA>
## 4 4 эстонский Estonian Estonian Мерит Ни… <NA>
## 5 5 цахурский Tsakhur Tsakhur Жиль Отье Ахмед Давудов
## 6 6 тувинский Tuvinian Tuvinian Софья Ал… <NA>
## # ℹ 17 more variables: expert <chr>, consultant <chr>,
## # data_collection_year <chr>, initial_release_date <chr>,
## # last_release_date <chr>, macroarea <chr>, `family (WALS)` <chr>,
## # `genus (WALS)` <chr>, latitude <dbl>, longitude <dbl>,
## # number_nominal_cases <dbl>, source_of_information <chr>,
## # contact_language <chr>, glottocode <chr>, comment_on_glottocode <chr>,
## # WO_WALS <chr>, WO_comment <chr>
There is definitely too much stuff here we are unlikely to use, let’s
select()
only some variables:
languages <- languages %>% select(language_no, language, macroarea, glottocode, latitude, longitude)
We want to join our two data frames languages
and valency
. They
share one variable language_no
, which we will use to perform the join.
There are various options for joining data frames. The choice matters when the data frames do not have overlapping set. This is not the case in our data set, so the various functions will yiled the same result. But in case you are curious:
-
a
left_join()
keeps all observations in data frames x, -
a
right_join()
keeps all observations in data frames, -
a
full_join()
keeps all observations in x and y.
Let’s take left_join()
and check our enhanced data frame:
valency <- left_join(languages, valency, by = "language_no")
head(valency)
## # A tibble: 6 × 11
## language_no language macroarea glottocode latitude longitude predicate_no
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Russian Europe russ1263 56 38 1
## 2 1 Russian Europe russ1263 56 38 2
## 3 1 Russian Europe russ1263 56 38 3
## 4 1 Russian Europe russ1263 56 38 4
## 5 1 Russian Europe russ1263 56 38 5
## 6 1 Russian Europe russ1263 56 38 6
## # ℹ 4 more variables: X <fct>, Y <fct>, locus <fct>, valency_pattern <fct>
Now that we have some idea about the langauges in our dataset, we can explore them a bit further.
Let’s look at some valency details for a langauge or two. filter()
allows you to access only those data which fullfil the specified
conditions (e.g. ==
means it has to exactly matching "English"
)
valency %>% filter(language == "English") %>% count(locus)
## # A tibble: 3 × 2
## locus n
## <fct> <int>
## 1 * 3
## 2 TR 81
## 3 Y 46
Is English any different form German?
valency %>% filter(language == "German") %>% count(locus)
## # A tibble: 4 × 2
## locus n
## <fct> <int>
## 1 * 1
## 2 TR 71
## 3 X 5
## 4 Y 53
It is! What about some other Germanic langauges?
Here, a picture (a plot) might do a better job. Stacked barplots is one option.
valency %>% filter(language %in% c("English", "German", "Icelandic", "Dutch")) %>%
ggplot(aes(x = language, fill = locus)) +
geom_bar()
Interprete the plot
-
Which language is the most diverse in terms of valency patterns (types)?
-
Which language is most transitive (with respect to the 130 meanings in the sample)?
-
Which language is the least transitive?
Select another set of languages and compare the distributions of their valency patterns.