-
Notifications
You must be signed in to change notification settings - Fork 25
Data Model & Tools Config
A first pass distillation of the Scribe data model as shared by ruby and js.
The project is the top-level element defining project properties, site pages, and defining workflows. There SHOULD be only one project. It contains the following fields:
-
description
: Text -
producer
: String -
title
: String -
workflows
: Array of WORKFLOWs
Supported workflows are Mark, Transcribe, and Verify. Their common properties are:
-
key
: String - (Established as key of the task in the tasks hash in the workflow json.) Unique alphanumeric key for this workflow. Must be one of 'mark','transcribe','verify' for now.. -
label
: String - Friendly name for workflow (e.g. "Mark Stuff!") -
first_task
: TASK key of first task to invoke. If the subject_type of the subject loaded matches a task key in the current workflow, that task will be loaded first instead. -
tasks
: Hash mapping task keys to TASKs -
generates_subjects
: Bool - Indicates that some submitted classifications may generate secondary/tertiary subjects. -
generates_subjects_for
: String - Name of next workflow to associate with generated subjects, e.g. 'transcribe','verify' -
generates_subjects_after
: Int - Number of classifications a generated subject must represent before it's activated for the next workflow. In Transcribe and Verify workflow, upon activating a generated subject, the parent subject is retired (retire_limit ignored). -
generates_subjects_max
: Int - Max number of classifications submitted into a generated subject before we mark its status 'contentious' -
retire_limit
: Int - Number indicating threshold for retiring the subject operated on in a given workflow. Really only relevant to Mark worfklow. In Mark workflow, retire_limit is the number of times we require someone to say "There is nothing left to mark on this doc". In the Transcription & Verification workflows, retire_limit is ignored in favor of subject_generation_after.
-
key
: String - (Established as key of the task in the tasks hash in the workflow json.) Workflow-unique alphanumeric (e.g. '0','1','mark_one') -
tool
: Enum "pickAndMarkOne", "pointTool", "rectangleTool", "pickOne", "textTool", "numberTool", "dateTool", "compositeTool", "verifyTool" -
instruction
: Text - Friendly prompt given to user, which contextualizes task (e.g. "How many penguins are there?", "What color is this penguin?", "Choose the type of document") -
help
: Filename of html file in project config to load into a little slide-out/modal. [to be better defined] -
generated_subject_type
: String - Unique string identifying the type of subject generated. This must be unique across all tasks in the workflow. The value must match a task key in the destination workflow. -
tool_config
: Hash - Specify arbitrary tool options. See "Tools"
An classification is a single statement a user makes about a subject in response to a single task. Drawing tasks create classifications that represent a single polygon. (Note many drawing tools are configured with repeat=true to generate multiple classifications.) PickOne tasks produce classifications that store the option chosen (e.g. 'yes','no'). Transcription tasks create classifications with the text entered.
-
started_at
: Date - Submitted by JS when created -
finished_at
: Date - Submitted by JS when created -
user_agent
: String - Submitted by JS when created -
subject
: SUBJECT -
workflow
: WORKFLOW -
annotation
: Hash - Hash of data collected by task.
A note on the data
field: The data
should always be a Hash, even if the tool that generates it produces a single, scalar value. In those cases, the data
should look like {value: '...'}.
Marking tool (e.g. pickOneMarkOne) example data
:
{x: .., y: .., width: .., height: ..}
PickOne tool example data:
{value: 'yes'}
Transcription tool example data
:
{value: 'Bond St'}
Or, if a compositeTool is used, the keys of each tool in the tools
configuration option should be used as the keys for the collected values:
{first_name: 'Charlie', last_name: 'Brown'}
VerificationTool data
should store the chosen annotation value, so it will look identically to transcription data
:
{value: 'Bond Street'}
A "primary" subject represents a single image. A "secondary" subject represents an annotation made on a primary subject. A "tertiary" subject annotates a secondary subject. Fields include:
-
name
: String - For 'root' subjects. Optional name for subject provided by CSV import. -
location
: Hash mapping identifiers to URLs:-
standard
: URL of standard image deriv -
thumbnail
: URL of thumbnail -
spec
: Hash specifying a region of the image. (Appears in some secondary subjects.) May contain:-
x
: Integer - Pixel coordinate in 'standard' deriv -
y
: Integer - '' -
width
: Integer - Pixel width of region (if applicable) -
height
: Integer - Pixel height of region (if applicable)
-
-
-
type
: String - Default for primary subjects is "root". Secondary subject types are determined by thesubject_type
configured in task. -
meta_data
: Hash - Includes width, height and other arbitrary data known about root subjects - imported from subject CSVs that might be useful when transcribing like subject type, date -
annotation
: Hash - The classification data. For generated secondary/tertiary subjects, this hash should be copied fromdata
property of the classification(s) that generated it. -
classification_count
: Int - Denormalized count of classifications of this subject. -
retire_count
: Int - Default 0. Number of times a user has said "There's nothing left to mark". Only relevant in Mark. -
status
: Enum "active", "done", "contentious" - Only 'active' subjects Subjects marked done are not served to any workflow.
Example root subject:
{
"name": "Page 1",
"location": {
"state" : "active",
"type" : "root",
"file_path" : "http://demo.zooniverse.org/whaling-logs-test-data/KWM_51_JPGs/logbookofphillip00unse_0241.jpg",
"location" : {
"standard" : "http://demo.zooniverse.org/whaling-logs-test-data/KWM_51_JPGs/logbookofphillip00unse_0241.jpg",
"thumbnail" : "http://demo.zooniverse.org/whaling-logs-test-data/KWM_51_JPGs/logbookofphillip00unse_0241.jpg"
},
"retire_count" : null,
"meta_data" : {
"order" : "241",
"width" : "2048",
"height" : "3380",
"capture_location" : "n/a",
"date" : "n/a",
"set_key" : "n/a"
},
"workflow_id" : ObjectId("55477dbb7061751603000000"),
"subject_set_id" : ObjectId("55477dbf70617516036f0200"),
"random_no" : 0.6645554681836298,
"updated_at" : ISODate("2015-05-04T14:10:08.327Z"),
"created_at" : ISODate("2015-05-04T14:10:08.325Z")
"standard": ...
}
}
"classification_count" : 0,
}
A Subject always belongs to a single SubjectSet. Multi-page documents are represented by multiple subjects associated by a single subject-set. Fields include:
-
name
: String -
subjects
: Many SUBJECTs
Groups organize subject-sets into related collections.
Note that although group membership and metadata may be helpful to the project maintainer and transcriber (if group metadata is displayed with member subjects) annotations are applied to subjects exclusively. It's not currently possible to annotate a group.
Group fields include:
-
name
: String - As given by groups CSV -
description
: Text - As given by groups CSV -
cover_image_url
: String - URL of group representative image -
external_url
: String - URL of another representation of this object (e.g. wikipedia) if avail -
meta_data
: Hash - Includes arbitrary known data imported from group CSVs that might be useful to display in transcription interface -
selection_method
: Enum "linear", "random" - Indicates method for selecting subject-sets from this group for marking, whether linearly or randomly. -
subject_sets
: Many SUBJECTSETs
Tools are pluggable, configurable widgets that perform a single, simple task related to identifying an area of the subject ("marking"), adding data to a subject ("transcribing"), or moving the user from one tool to the next ("core").
Tools are specified in a task config via tool
and tool specific configuration is specified via tool_config
. For example:
...
tasks: {
"determine_has_records": {
"tool": "pickOne",
"tool_config": {
"options": {
"yes": {
"label": "Yes",
"next_task": "identify_records"
},
"no": {
"label": "No"
}
}
}
}
}
Certain tools (e.g. 'pickOne') are "core tools", meaning they can appear in any workflow. Core tools are defined in components/core-tools. (If a tool isn't a core tool, it can be found in either components/mark or components/transcribe depending on the workflow in which it appears.)
Pick One is a simple tool that presents two or more optional tasks. Supported configuration options:
-
options
: Hash mapping keys to hashes with following properties:-
label
: String - Friendly label of option (e.g. "This looks like a Casualty Form...", "This looks like an attestation..") -
next_task
: String - Key of TASK to jump to if user clicks this option.
-
The classification generated by a pickOne task contains the following data fields:
-
value
: String - The key of the option chosen.
Pick Many is similar to Pick One, but allows user to select multiple options before continuing, all of which are stored in a single generated classification. Supported configuration options:
-
options
: Hash mapping keys to hashes with following properties:-
label
: String - Friendly label of option (e.g. "This looks like a Casualty Form...", "This looks like an attestation..")
-
Marking tools include various methods for identifying specific points and areas of images. They're defined in components/mark/tools.
All marking tools accept the following config params (in addition to tool specific params noted below):
-
fill_color
: String - CSS color. (Default "rgba(0,0,0,0.30)") -
stroke_color
: String - CSS color. (Default "#fff") -
stroke_width
: Integer Pixel stroke width (Default 3)
All marking tools generate the following classification data (in addition to tool-specific data noted below):
-
x
: Integer - Pixel coordinate within parent subject -
y
: Integer - Pixel coordinate within parent subject
PickOneMarkOne is the sole marking tool. It produces a menu of "marking types" in the right column, which are associated with user-supplied labels.
Tool-specific config options include:
-
options
: Array of hashes defining the kind of marking types that can be made.
Each hash passed to options
should define a marking type using the following properties:
-
type
: The marking type. Must be one of "pointTool", "rectangleTool", "textRowTool" -
label
: The label to display, which the user clicks on to activate the marking type. -
color
: The color of the displayed mark (?)
The supported marking types and their optional (proposed) additional config params are described below:
A simple point on the document. Optional config:
-
radius
: Integer - Pixel radius (Default 40)
Rectangular selector for identifying arbitrary rectangular regions of a document.
Tool-specific config options include:
-
min_height
: Integer (or float percentage of subject) -
max_height
: ditto
Tool-specific classification data generated by rectangleRow tools:
-
width
: Integer - Width of region -
height
: Integer - Height of region
Document-wide rectangular selector suited to identifying rows of horizontal text that span the width of the document.
Tool-specific config options include:
-
min_height
: Integer (or float percentage of subject) -
max_height
: ditto
Tool-specific classification data generated by textRow tools:
-
yUpper
: Integer -
yLower
: Integer
Example pickOneMarkOne config:
"identify_records": {
"tool": "pickOneMarkOne",
"instruction": "Pick a field and mark it with the corresponding marking tool.",
"tool_config": {
"options": [
{ "type": "rectangleTool",
"label": "Blocky region of the doc",
"color": "green",
"max_height": 0.6
},
{ "type": "rectangleTool",
"label": "Row of text",
"color": "blue"
}
]
}
}
Transcribe tools are widgets suitable for gathering typed data with configurable constraints.
Probably the simplest transcription tool, the text tool presents a single text input. The tool can be augmented with options below.
Options:
-
limit
: Integer - Character limit. -
suggest
: Indicates should autocomplete.Suggest
supports the following possible values: - An array of literal strings (e.g. ["cat","dog","other"])
- A URL returning auto-complete suggestions for current entry (e.g. "http://example.com/terms/suggest?term=%%TERM%%" )
- The phrase "common", which indicates the most commonly typed values for the current input will be suggested.
-
multiline
: Boolean - Indicates whether or not value is expected to have line-breaks. Note that sufficiently large values oflimit
imply use of a textarea regardless. -
match
: String - Regex defining valid strings (e.g. "^[a-z]+$")
Configuration example:
...
tasks: {
"transcribe_mortgager_name": {
"tool": "textTool",
"tool_config": {
"limit": 100
}
}
}
...
An extension of the Text Tool (perhaps using match
option to restrict characters like "^-?\d+([,.]\d+)?$"). Supported config options:
-
minimum
: Integer/Float -
maximum
: Integer/Float
A date (and date range) picker that supports approximates dates and pre-1970 dates. Supported config options:
-
minimum
: String - ISO 8601 date string establishing oldest allowed date (e.g. "-30000101" for 3000 BCE) -
maximum
: String - ISO 8601 date string establishing maximum allowed date (e.g. "20150227") -
range
: Boolean - If true, a date range may be selected -
allow_approximate
: Boolean - If true, user may check a box to indicate date is approximate.
A composite tool is a tool composed of two or more basic tools. A composite tool presents multiple tools side by side for cases where the mark being considered contains multiple distinct data that are confusing to consider in isolation. Config options include:
-
tools
: Array of hashes defining what tools to compose. Each hash should include: -
tool
: Key of tool -
tool_config
: Hash of tool specific config options (refer to tool specific config options above)
Note that composite tool classifications are special in that they are a hash of the classifications generated by each of their constituent tools. For example, if a composite tool is configured like this:
"em_transcribe_valuation": {
"tool": "compositeTool",
"tool_config": {
"tools": {
"em_valuation_date": {
"tool": "dateTool",
"tool_config": {},
"label": "Record Date"
},
"em_valuation_amount": {
"tool": "textTool",
"tool_config": {},
"label": "Amount"
}
}
},
"instruction": "Enter any dated property valuations that were recorded"
}
Props
workflow
Members
-
subject_set_viewer
: SubjectSetViewer -
subject_sets
: Array of SubjectSets
Props
-
subject_set
: SubjectSet
Members
-
subject_viewer
: SubjectViewer
Props
workflow
Members
-
subject_viewer
: SubjectViewer -
subjects
: Array of Subjects
Props
-
subject
: A primary/secondary subject -
tool
: A (transcription) tool to render overlaid on the viewer classification
annotation
Members
- Getting Started
-
Setting up your Project
- Setup Your Environment
- Configure your project
- Load your project
- Code & Technical Notes
- Project Reference