Skip to content

Commit

Permalink
Fixed plots in vignettes
Browse files Browse the repository at this point in the history
  • Loading branch information
stephaniereinders committed Dec 9, 2024
1 parent d70ff85 commit 5640860
Show file tree
Hide file tree
Showing 4 changed files with 256 additions and 30 deletions.
10 changes: 1 addition & 9 deletions R/plots.R
Original file line number Diff line number Diff line change
Expand Up @@ -115,16 +115,8 @@ plot_scores <- function(scores, obs_score = NULL, n_bins = 50) {
angle = 90,
vjust = -1,
hjust = 0.5
) + # add text
ggplot2::labs(title = "The observed similarity score compared to reference similarity scores", x = "Score", y = "Rate")
} else {
p <- p +
ggplot2::labs(title = "Reference similarity scores", x = "Score", y = "Rate")
)
}
p <- p +
ggplot2::theme(legend.position = "bottom",
legend.text = ggplot2::element_text(size = 6),
legend.title = ggplot2::element_text(size = 8))

return(p)
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,26 +1,31 @@
---
title: "SLRs from Random Forests Tutorial"
output: rmarkdown::html_vignette
title: "An Introduction to the SLR Model"
output:
rmarkdown::html_vignette:
toc: true # Enable Table of Contents
toc_depth: 2 # Set the depth of the TOC (adjust as needed)
number_sections: true
css: "styles.css"
vignette: >
%\VignetteIndexEntry{slrs-from-random-forests-tutorial}
%\VignetteIndexEntry{an-introduction-to-the-slr-model}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
```{r knitr, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
comment = "#>",
fig.width = 7,
fig.height = 5
)
```

```{r setup}
library(handwriterRF)
```
## Introduction

The `handwriterRF` package implements the statistical method described by Madeline Johnson and Danica Ommen (2021) (<doi:10.1002/sam.11566>). This tutorial summarizes the method introduced in the paper and explains how to use `handwriterRF` to compare handwriting samples. The method employs a random forest to produce a score-based likelihood ratio (SLR), quantifying the strength of evidence that two handwritten documents were written by the same writer or different writers.

## The Data
## Handwriting Data

We use handwriting samples from the [CSAFE Handwriting Database](https://data.csafe.iastate.edu/HandwritingDatabase/) and the [CVL Handwriting Database](https://cvl.tuwien.ac.at/research/cvl-databases/an-off-line-database-for-writer-retrieval-writer-identification-and-word-spotting/). These databases contain paragraph-length handwriting samples. We randomly selected two Wizard of Oz prompts and two London Letter prompts from CSAFE writers, and four prompts from CVL writers. These samples were randomly split into three sets: training, validation, and testing.

Expand All @@ -34,16 +39,21 @@ We estimated the writer profiles for all handwriting samples using the `get_writ
4. Calculate the proportion of graphs assigned to each cluster with `get_cluster_fill_rates()`. The cluster fill rates serve as an estimate of the writer profile for the sample.

The `train` data frame contains the estimated writer profiles for train set. Let's visualize the writer profiles for two writers from `train`:
```{r}

```{r profiles-facet}
library(handwriterRF)
wps <- train %>% dplyr::filter(writer == "w0004" | writer == "w0015")
plot_writer_profiles(wps, color_by = "writer", facets = "writer")
```

Each writer has four documents in `train`. We see that for each writer, the profiles are not exactly the same, but many of the spikes and valleys occur in the same clusters. We can plot all the writer profiles on the same axes to better compare the two writers.

```{r}
```{r profiles-combined}
plot_writer_profiles(wps, color_by = "writer")
```

In this plot the spikes and valleys are not all aligned. In cluster 37, writer w0004 has a small spike while w0015 has a valley. In cluster 27, writer w0015 has a taller spike that writer w0004. Intuitively, we see similarities and differences between the writer profiles in the plot. But we employ a statistical method to formally evaluate the similarities between writer profiles.

## Constucting Reference Similarity Scores with a Random Forest
Expand All @@ -60,15 +70,15 @@ Next, we calculate the distances between each pair of writer profiles in `valida

To visualize the reference similarity scores, we plot the rates of scores assigned to different bins (rather than frequencies), as the data often contains far more "different writer" pairs than "same writer" pairs. This gives us a clearer view of the distribution of reference scores.

```{r, dpi=300}
```{r ref-scores}
plot_scores(ref_scores)
```

## Compare Two Test Handwriting Samples

The `test` data frame contains writer profiles from writers not in `train` or `validation`. Let’s compare two writer profiles in the `test` set using the trained random forest and reference similarity scores. We’ll use the first two samples from writer `w0005` as an example. First, we plot the writer profiles:

```{r}
```{r example-profiles}
test_samples <- test[test$writer == "w0005",][1:2,]
plot_writer_profiles(test_samples)
Expand All @@ -78,14 +88,14 @@ plot_writer_profiles(test_samples)

We can compute the similarity score between these two test samples using the `compare_writer_profiles()` function. This score is derived using the same procedure as for the validation set: we calculate the distance between the two profiles, then compute the proportion of random forest decision trees that predict "same writer."

```{r}
```{r example-score}
score <- compare_writer_profiles(test_samples)
score
```

Let's visually see how the similarity score `r score$score` compares to our same writer and different writers similarity scores.

```{r}
```{r plot-score}
plot_scores(ref_scores, obs_score = score$score)
```

Expand All @@ -99,7 +109,7 @@ $P_2: \text{the handwriting samples were written by different writers}$

The SLR is the ratio of the likelihood of observing the similarity score under $P_1$ to the likelihood under $P_2$. To calculate the SLR, we use `compare_writer_profiles()` with the `score_only = FALSE` argument. This function applies kernel density estimation to fit probability density functions (PDFs) to the reference scores for same writer and different writer pairs. The SLR is the ratio of the height of the same writer PDF at the observed similarity score to the height of the different writer PDF at the same score. An SLR greater than 1 suggests the samples were likely written by the same writer, while an SLR less than 1 suggests the samples were likely written by different writers.

```{r}
```{r slr}
slr <- compare_writer_profiles(test_samples, score_only = FALSE)
slr
```
217 changes: 217 additions & 0 deletions vignettes/styles.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
@import url('https://fonts.googleapis.com/css2?family=Montserrat:ital,wght@0,100..900;1,100..900&display=swap');

:root {
--blue: #003A70;
--lt-blue: #6BA4B8;
--purple: #981E97;
--lt-purple: #B24DB2;
--dk-gray: #343a40;
--md-gray: #545859;
--lt-gray: #ddd;
--orange: #F68D2E;
}

/* Increase the navbar logo size */
.navbar-brand img {
max-height: 50px; /* Increase the max width as needed */
width: auto; /* Maintain aspect ratio */
vertical-align: middle;
}

.navbar {
background-color: white !important;
padding-left: 10px;
padding-bottom: 0px;
margin-bottom: 0px;
height: 70px;
display: flex;
align-items: flex-end;
}

.navbar > .container,
.navbar > .container-fluid,
.navbar > .container-sm,
.navbar > .container-md,
.navbar > .container-lg,
.navbar > .container-xl,
.navbar > .container-xxl {
display: flex;
flex-wrap: inherit;
justify-content: space-between;
align-items: flex-end;
}

/* Inactive tab color */
.navbar-nav .nav-link {
color: var(--md-gray); /* Color for inactive links */
padding: 0;
}

/* Active tab color */
.navbar-nav .nav-link.active, .navbar-nav .nav-link.show {
color: var(--dk-gray); /* Color for active links */
font-weight: bold;
}

/* Hover tab color */
.navbar-nav .nav-link:hover {
color: var(--dk-gray); /* Color on hover */
font-weight: bold;
}

hr {
margin: 1rem 0;
color: var(--md-gray);
border: 0;
border-top: 1px solid;
opacity: 0.25;
}

h6, .h6, h5, .h5, h4, .h4, h3, .h3, h2, .h2, h1, .h1 {
margin-top: 0;
margin-bottom: 0.5rem;
font-weight: 400;
line-height: 1.2;
color: var(--purple);
}

h1, .h1 {
font-size: calc(1.75rem + 1.5vw);
font-weight: 100;
}
@media (min-width: 1200px) {
h1, .h1 {
font-size: 3rem;
}
}

h2, .h2 {
font-size: calc(1rem + 0.6vw);
font-weight: 700;
}
@media (min-width: 1200px) {
h2, .h2 {
font-size: 2rem;
}
}

h3, .h3 {
font-size: calc(1rem + 0.6vw);
font-weight: 500;
}
@media (min-width: 1200px) {
h3, .h3 {
font-size: 1.75rem;
}
}

h4, .h4 {
font-size: calc(1.275rem + 0.3vw);
}
@media (min-width: 1200px) {
h4, .h4 {
font-size: 1.5rem;
}
}

h5, .h5 {
font-size: 1.25rem;
}

h6, .h6 {
font-size: 1rem;
}

/* Table of contents in sidebar */
.sidebar nav[role="doc-toc"] ul > li > a:hover,
.sidebar nav[role="doc-toc"] ul > li > ul > li > a:hover {
color: var(--purple) !important;
font-weight: bold;
}
.sidebar nav[role="doc-toc"] ul > li > a.focus,
.sidebar nav[role="doc-toc"] ul > li > ul > li > a.focus,
.sidebar nav[role="doc-toc"] ul > li > a.active,
.sidebar nav[role="doc-toc"] ul > li > ul > li > a.active {
color: var(--purple) !important;
font-weight: bold;
border-left-width: 1px;
border-left-style: solid;
border-left-color: var(--purple);
}

/* Unvisited link */
a:link {
color: var(--purple); /* Change to your desired color */
}

/* Visited link */
a:visited {
color: var(--lt-purple); /* Change to your desired color */
font-weight: normal;
}

/* Mouse over link */
a:hover {
color: var(--lt-purple); /* Change to your desired color */
font-weight: normal;
}

/* Selected link */
a:active {
color: var(--lt-purple); /* Change to your desired color */
font-weight: normal;
}

/* Body */
body {
font-family: 'Montserrat', sans-serif;
font-weight: 500;
font-size: 16px;
color: var(--md-gray);
}


*, ::after, ::before {
box-sizing: border-box;
}

/* Scenario Cards */
.twocards {
display: flex;
flex-direction: row;
}

.card {
border: 1px solid #dadada;
margin: 15px;
box-shadow: 4px 4px 8px 0 rgba(0, 0, 0, 0.2);
transition: 0.2s;
width: 50%;
display: flex;
flex-direction: column;
}

.card h3 {
margin-left: 14px;
margin-right: 14px;
margin-bottom: 4px;
margin-top: 24px;
}

.card:hover {
box-shadow: 8px 8px 16px 0 rgba(0, 0, 0, 0.2);
}

.card .container {
padding: 2px 14px;
flex: 1;
display: flex;
flex-direction: column;
}

.card p {
margin-left: 14px;
margin-right: 14px;
margin-bottom: 4px;
margin-top: 0;
}
Original file line number Diff line number Diff line change
@@ -1,17 +1,24 @@
---
title: "Make New Random Forests and Reference Scores"
output: rmarkdown::html_vignette
title: "Training an SLR Model"
output:
rmarkdown::html_vignette:
toc: true # Enable Table of Contents
toc_depth: 2 # Set the depth of the TOC (adjust as needed)
number_sections: true
css: "styles.css"
vignette: >
%\VignetteIndexEntry{train-model}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
comment = "#>",
fig.width = 7,
fig.height = 5
)
```

HandwriterRF has a pre-trained random forest and set of reference similarity scores that you may use with the functions `compare_documents()` and `compare_writer_profiles()`. This tutorial shows you how to train your own random forest and create your own set of reference scores.
Expand Down

0 comments on commit 5640860

Please sign in to comment.