-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
59 lines (48 loc) · 1.73 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
output:
md_document:
variant: gfm
bibliography: refs.bib
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# dblinkR: An R interface for dblink
## Overview
`dblinkR` is an R interface for
[`dblink`](https://github.com/cleanzr/dblink)—an Apache Spark package for
performing unsupervised entity resolution.
It implements a generative Bayesian model for entity resolution called `blink`
[@steorts_entity_2015], with extensions proposed in [@marchant_d-blink_2021].
Unlike many entity resolution methods, `dblink` approximates the full posterior
distribution over the _linkage structure_.
This facilitates propagation of uncertainty to post-entity resolution
analysis, and provides a framework for answering probabilistic queries about
entity membership.
## Installation
`dblinkR` is not currently available on CRAN.
The latest development version can be installed from source using `devtools`
as follows:
```{r, eval=FALSE}
library(devtools)
install_github("ngmarchant/dblinkR")
```
### Dependencies
`dblinkR` depends heavily on the `sparklyr` R interface for Apache Spark.
Please refer to the `sparklyr` [website](https://spark.rstudio.com/) for
information about connecting to a Spark deployment.
`dblinkR` currently supports Spark releases in the 2.3.x series and 2.4.x
series. Spark releases prior to 2.3.x are not supported.
## Example
The [RLdata500 vignette](vignettes/RLdata500.Rmd) demonstrates how to
use `dblinkR` to perform entity resolution for a small synthetic
data set. This example is small enough to run on a laptop (Spark
cluster not required).
## Licence
GPL-3
## References