Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

Introduction

Sascha T. Ishikawa edited this page Jul 9, 2015 · 5 revisions

Introduction

Scribe 2.0 is a partial-text, or metadata, transcription framework that reduces a complex transcription task into simpler, self-contained sub-tasks. The division of labor into atomic tasks is further reinforced by providing volunteers with canonical and independent workflows to mark, transcribe, and verify documents. Furthermore, it enables volunteers to step through any workflow in no particular order. Though Scribe 2.0 was not originally intended to be a full-transcription framework, we believe it can be adapted as such.

The Mark-Transcribe-Verify Framework

The front-end first downloads a document for transcription. It is retrieved in the form of a subject, which is a JSON object containing a URI to a uniquely identifiable media document (in this case an image of a document that needs to be transcribed) along with data that may indicate how many other users have transcribed the particular document, additional metadata, etc.

The process of transcribing an entire document begins with the “mark” workflow. Given a document, a volunteer is asked to choose from a set of tools to mark specific pre-defined regions in need of transcription. Once a volunteer indicates that a mark is complete, by clicking the “done” button, it is submitted to the server as a classification. A classification is a record of a volunteer’s response(s) to a given subject’s workflow and contains important information used for the final transcription, A/B splits, etc. This information includes the subject ID, timestamp(s), A/B split designations, and a hash of annotations with subject-specific responses such as mark locations and dimensions, transcription text, or verification responses.

Once a “mark” classification is received, it is processed [1] by the server, saved to a database, and a secondary “transcribe” subject is generated. The “transcribe” subject references its parent subject and contains, among other information, a URI to the same image as the subject it was generated from; however, it is part of a separate “transcribe” workflow that asks users to focus on a subregion of the image corresponding to the mark that was generated previously (either by the same user or someone else). Each workflow fetches its own subjects and therefore can be accessed independently, with a separate UI. In other words, the unit of work is atomic and allows different stages of transcription to be completed, not only by volunteers, but by automated approaches. For example, in some instances, the “mark” workflow may be replaced with a computer vision algorithm that detects regions of text to be transcribed. In such cases, volunteers would only interact with a “transcribe” (and perhaps “verify”) workflow.

From the front-end perspective, the workflows operate independently of each other; they are routed to their own URL hashes; they request their own subjects, use their own tools, and create their own classifications. However, they are linked through the back-end. A “mark” subject produces a “mark” classification, which generates a secondary “transcribe” subject and then a “transcribe” classification. Depending on the needs of a project, each “mark” or “transcribe” classification may produce additional “verify” subjects that can be presented to volunteers as a final verification step before a final transcription is produced. It becomes clear that each subsequent workflow depends on the “mark” workflow. Without any marks, there is nothing to transcribe or verify.

The separation of concerns between the “mark” and “transcribe” (and “verify”) workflows is an important aspect of Scribe 2.0 for several reasons. First and foremost, it removes the need to store the state in between workflows. This simplifies the front-end logic; rather than having an all-encompassing UI component that keeps track of a user’s progress throughout multiple workflows, where each step potentially reflects another intermediate state, having a single component responsible for each workflow reduces the number of states a component is responsible for. This simplifies the design of each component and promotes a more modular and readable codebase.

This serves to highlight one more advantage of independent workflows in the front-end. Assuming enough marks are produced, volunteers are afforded the flexibility to choose any workflow over the others, in no particular order, or step through all of them in sequence.

[1]: “marks” are sent to the server as a classification and denormalized into a transcription subject. During denormalization, some fields from the “mark” classification are transferred to a “transcribe” subject when it is generated. The more crucial fields include the “mark” subject’s ID, the annotation data that specifies a mark’s location and/or dimensions, and the type of data represented by the mark.