-
Notifications
You must be signed in to change notification settings - Fork 25
Data Model & Tools Config
A first pass distillation of the Scribe data model as shared by ruby and js.
The project is the top-level element defining project properties, site pages, and defining workflows. There SHOULD be only one project. It contains the following fields:
-
description
: Text -
producer
: String -
title
: String -
workflows
: Array of WORKFLOWs
Because tools are namespaced by workflow key ('mark','transcribe'), there will be only two workflows possible without hacking. [Verify] Workflow properties include:
-
key
: String - Unique id for this workflow. (e.g. 'mark', 'transcribe') -
label
: String - Friendly name for workflow (e.g. "Mark Stuff!") -
first_task
: TASK id -
tasks
: Hash mapping task ids to TASKs [Or is it an array, as in project/example_project/workflows/mark.rb ?] -
enables_workflows
: Hash mapping WORKFLOW ids to sub-hashes that include the following keys:-
denormalized_fields
: Array of fields to copy into generated subjects. By default, the id of the SUBJECT acted upon (and tasktype
?) is also copied. [Verify] - [Anything else? What other config required for subject generation?]
-
The enables_workflows object specifies for each workflow key a simple specification for transitioning submitted annotations from the current workflow to the workflow identified by the key.
-
key
: String - Workflow-unique alphanumeric (e.g. '0','1','mark_one') -
type
: Enum "text","integer","date" - Seems to include these basic data types but also things like "drawing", "textBlock". Seems liketype
could refer to data type desired (e.g. string, date, int, float, bounding box, point, polygon) but may in current practice be used to hint at tool (e.g. pick_page_type, textBlock). -
directions
: Text - Friendly prompt given to user (e.g. "Categorize the type of document") -
field_name
: String - Another alphanumeric workflow-unique id like 'key' above? [Verify] -
tool
: String - Identifies tool to show. Note that there are distinct tools for mark and transcribe workflows. [This is sometimes given astools
]* -
options
: Hash - Specify arbitrary tool options. See "Tools"
-
location
: String - [Not sure what this is] -
annotations
: Object - [Not sure why this is an Object, what its composition is] -
started_at
: Date - Submitted by JS when created -
finished_at
: Date - Submitted by JS when created -
user_agent
: String - Submitted by JS when created -
subject
: SUBJECT -
workflow
: WORKFLOW
A "primary" subject represents a single image. A "secondary" subject represents an annotation made on a primary subject. A "tertiary" subject annotates a secondary subject. Fields include:
-
name
: String -
location
: Hash mapping identifiers to URLs. (e.g. 'standard' => 'http://...') -
type
: Enum "root", ... [What other values?] -
meta_data
: Hash - Includes width, height and other arbitrary known data imported from subject CSVs that might be useful when transcribing like subject type, date -
thumbnail
: String - Thumbnail URL -
file_path
: String - Path/URL to original asset [Verify] -
classification_count
: Int - Denormalized count of subjects that are annotations on this subject. -
retire_count
: Int - Number of annotations required before this subject is retired. Inherited from Group. -
state
: Enum "active", "done" - Subjects marked done are not served to any workflow. -
random_no
: Float (0...1) - Generated on-save. -
order
: Int - Order within Group? [Verify]
Groups organize subjects into related sets.
Note that although group membership and metadata may be helpful to the project maintainer and transcriber (if group metadata is displayed with member subjects) annotations are applied to subjects exclusively. It's not currently possible to annotate a group.
Athough groups principally serve the purpose of collecting several intellectually distinct items under a single familiar topic or entity (e.g. grouping by ship in Old Weather), Groups may be exploited in the future to enable multi-image document annotation. Many items - like theater playbills - span multiple pages and it's sometimes necessary to see a page in the context of it's neighbors to adequately transcribe it. Theater playbill cast lists sometimes span multiple pages, for example; Page two of such a list may be confusing without being able to see the list heading on the preceding page. [Verify]
Group fields include:
-
name
: String - As given by groups CSV -
description
: Text - As given by groups CSV -
cover_image_url
: String - URL of group representative image -
external_url
: String - URL of another representation of this object (e.g. wikipedia) if avail -
meta_data
: Hash - Includes arbitrary known data imported from group CSVs that might be useful to display in transcription interface
Tools are pluggable, configurable widgets that perform a single, simple task related to identifying an area of the subject ("marking"), adding data to a subject ("transcribing"), or moving the user from one tool to the next ("core").
[Note that this gets pretty speculative. This is an attempt to formalize the conventions I see arising in the code.]
Certain tools (e.g. 'pick_one') are "core tools", meaning they may appear in either the Transcribe or Mark workflows. Core tools are defined internally. [Maybe better placed in components/core-tools or something?] (If a tool isn't a core tool, it can be found in either components/mark or components/transcribe depending on the workflow in which it appears.)
Pick One is a simple tool that presents two or more optional tasks. This is currently defined as a hash mapping keys (e.g. "history_sheet", "attestation") to sub-hashes with a single next_task key that indicates the task. It's unclear what label is used for the option. [Proposed revision below]*
-
options
: Array of hashes with following properties-
label
: Friendly label of option (e.g. "This looks like a Casualty Form...", "This looks like an attestation..") -
task
: Key of TASK to jump to if user clicks this option
-
Like Pick One, but allows user to return to the picker after completing a given option so that other tools can be used. This is used anytime there are multiple tools that may be relevant for a document, but they're not mutually exclusive; Multiple tools may be applied.
Marking tools include various methods for identifying specific points and areas of images. They're defined in components/mark.
All marking tools generate the following common metadata fields (in addition to tool-specific metadata noted below):
-
x
: Integer - Pixel coordinate within parent subject -
y
: Integer - Pixel coordinate within parent subject
All marking tools accept the following options (in addition to tool specific options noted below):
-
fill_color
: String - CSS color. (Default "rgba(0,0,0,0.30)") -
stroke_color
: String - CSS color. (Default "#fff") -
stroke_width
: Integer Pixel stroke width (Default 3)
Special Options:
-
radius
: Integer - Pixel radius (Default 40)
Document-wide rectangular selector suited to identifying rows of horizontal text that span the width of the document.
Extra metadata generated:
-
yUpper
: Integer -
yLower
: Integer
Options:
-
min_height
: Integer (or float percentage of subject) -
max_height
: ditto
[I'm not sure what this is.]
Extra metadata generated:
-
yUpper
: Integer -
yLower
: Integer
Options:
-
min_height
: Integer (or float percentage of subject) -
max_height
: ditto
Transcribe tools are widgets suitable for gathering specific types of data with configurable constraints.
At writing, this tool presents multiple forms - one for each transcribe task. Each form contains a single field styled based on the task type property.
Probably the simplest tool, the text tool presents a single text input. The tool can be augmented with options below.
Options:
-
limit
: Integer - Character limit. -
suggest
: Either an array of strings (e.g. ["cat","dog","other"]) or a URL returning auto-complete suggestions for current entry (e.g. "http://example.com/terms/suggest?term=%%TERM%%" ) -
multiline
: Boolean - Indicates whether or not value is expected to have line-breaks. Note that sufficiently large values oflimit
imply use of a textarea regardless. -
match
: String - Regex defining valid strings (e.g. "^[a-z]+$")
For example,
transcribe_workflow = {
"key": "transcribe",
"label": "Transcribe Content",
"first_task": "journal_entry",
"tasks": {
"journal_entry": {
"key": "0",
"tool": "text-tool",
"tool_options": {
"limit": null,
"multiline": true
},
"instruction": "Drag a mark around a block of text.",
}
}
"enables_workflows": {},
"project": ...
}
An extension of the Text Tool (perhaps using match
option to restrict characters like "^-?\d+([,.]\d+)?$")
-
minimum
: Integer/Float -
maximum
: Integer/Float
Useful when expected values are few, presents valid options as a list, optionally with a manual entry. Options include:
-
options
: Array of string values -
allow_other
: Boolean - If true, final option will be "Other..." and will allow manual entry -
multiselect
: Boolean - If true, multiple values can be selected.
Really just a shorthand for the Select Tool with options
= ["Yes", "No"]
A date (and date range) picker that supports approximates dates and pre-1970 dates.
-
minimum
: String - ISO 8601 date string establishing oldest allowed date (e.g. "-30000101" for 3000 BCE) -
maximum
: String - ISO 8601 date string establishing maximum allowed date (e.g. "20150227") -
range
: Boolean - If true, a date range may be selected -
allow_approximate
: Boolean - If true, user may check a box to indicate date is approximate.
A composite tool is a tool composed of two or more basic tools. A composite tool presents multiple tools side by side for cases where the mark being considered contains multiple distinct data that are confusing to consider in isolation. Options include:
-
tools
: Array of tool configurations from those listed above.
- Getting Started
-
Setting up your Project
- Setup Your Environment
- Configure your project
- Load your project
- Code & Technical Notes
- Project Reference