Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add annotation lists (WIP, do not merge) #62

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,13 @@
source 'https://rubygems.org'
gemspec

gem 'addressable'

# dev/test utilities
gem 'bundle-audit', require: false
gem 'byebug', require: false
gem 'diane', require: false
gem 'nokogiri', require: false
gem 'rubocop', require: false
gem 'simplecov', '0.17.1', require: false
gem 'yard', require: false
29 changes: 26 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,9 @@ collections:
metadata:
source: 'objects.csv' # path to the metadata file, must be within '_data'
images:
source 'source_images/objects' # path to the directory of source images, must be within '_data'
source: 'source_images/objects' # path to the directory of source images, must be within '_data'
annotations:
source: 'annotations/<collection>' # path to the directory of annotations for a collection within `_data`

# wax search index settings
lunr_index:
Expand All @@ -122,8 +124,9 @@ lunr_index:
```

The above example includes a single collection `objects` that comprises:
1. a CSV `metadata:source` file (`objects.csv`), and
2. a `images:source` directory of image and pdf files.
1. a CSV `metadata:source` file (`objects.csv`),
2. a `images:source` directory of image and pdf files, and
3. a `annotations` directory for annotation source files.

For more information on configuring Jekyll collections for __wax_tasks__, check out the [minicomp/wax wiki](https://minicomp.github.io/wiki/#/wax/) and <https://jekyllrb.com/docs/collections/>.

Expand Down Expand Up @@ -162,6 +165,26 @@ This task does *not* touch your source metadata or source image files! Instead,

`$ bundle exec rake wax:clobber collection-name`

### wax:import:hocr

Reads a given HOCR file and writes a simplified YAML file into ```_data/annotations/<collection>```. Takes four arguments: path to the HOCR file, collection, canvas, granularity. The ```canvas``` name is the basename of the corresponding image. Granularity may be one of word, line, or paragraph. (This import functionality is based on Ocracoke.) Imported files are named ```<collection>_<canvas>_ocr_<granularity>.yaml```.

### wax:annotations

Renders source files in ```_data/annotations``` to AnnotationList json files in ```img/derivatives/iiif```. Takes a collection name as argument. The source files may be yaml (in the simplified format generated by ```wax:import:hocr``` or json. Json files should be in the normal Wax pre-Jekyll format, with yaml headers:

```
---
layout: none
collection: <collection>
canvas: <canvas>
---
```

### wax:updatemanifest

Takes a collection name. Collects the ids of annotationlists associated with that collection and adds them to the appropriate canvases in the collection manifest using [```otherContent```](https://iiif.io/api/presentation/2.1/#canvas).

# Contributing

Fork/clone the repository. After making code changes, run the tests (`$ bundle exec rubocop` and `$ bundle exec rspec`) before submitting a pull request. You can enable verbose tests with `$ DEBUG=true bundle exec rspec`.
Expand Down
43 changes: 43 additions & 0 deletions lib/tasks/annotations.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# frozen_string_literal: true

require 'wax_tasks'

namespace :wax do
desc 'generate annotationlists from local yaml/json files'
task :annotations do
args = ARGV.drop(1).each { |a| task a.to_sym }
args.reject! { |a| a.start_with? '-' }

raise WaxTasks::Error::MissingArguments, Rainbow("You must specify a collection after 'wax:annotations'").magenta if args.empty?

site = WaxTasks::Site.new
args.each { |a| site.generate_annotations(a) }
end

task :updatemanifest do
args = ARGV.drop(1).each { |a| task a.to_sym }
args.reject! { |a| a.start_with? '-' }

raise WaxTasks::Error::MissingArguments, Rainbow("You must specify a collection after 'wax:updatemanifest'").magenta if args.empty?

site = WaxTasks::Site.new

args.each do |collection_name|
collection = site.collections.find { |c| c.name == collection_name }
annotationdata_source = collection.annotationdata_source

# TODO: just crawl the item directories
files = Dir.glob("#{annotationdata_source}/**/*.{yaml,yml,json}").sort
annotationlists = {}
files.each do |file|
# path like _data/annotations/documents/doc9031/doc9031_1.yaml
filepath = Pathname.new(file)
pid = filepath.dirname.basename.to_s # doc9031
annotationlists[pid] ||= []
annotationlists[pid] << file
end

collection.add_annotationlists_to_manifest(annotationlists)
end
end
end
18 changes: 18 additions & 0 deletions lib/tasks/import.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# frozen_string_literal: true

require_relative './import/hocr.rb'

namespace :wax do
namespace :import do
task :hocr, [:hocr_path, :collection, :canvas, :granularity] do |_t, args|
desc 'generate canvas-level annotationlist yaml file from hocr file'

# TODO: validate args

hocr_annotations = WaxTasks::HocrOpenAnnotationCreator.new(args)
hocr_annotations.save

puts 'done'
end
end
end
150 changes: 150 additions & 0 deletions lib/tasks/import/hocr.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# frozen_string_literal: true

require 'addressable/template'
require 'json'
require 'nokogiri'
require 'yaml'

# adapted from Okracoke:
# https://github.com/NCSU-Libraries/ocracoke/blob/master/app/processing_helpers/hocr_open_annotation_creator.rb
module WaxTasks
#
class HocrOpenAnnotationCreator
def initialize(args)
@hocr = File.open(args[:hocr_path]) { |f| Nokogiri::XML(f) }
@collection = args[:collection]
@identifier = args[:canvas]
@granularity = args[:granularity]

@uri_root = "{{ '/' | absolute_url }}\img/derivatives/iiif"
@canvas_root = "#{@collection}_#{@identifier}"
@label = "#{@canvas_root}_ocr_#{@granularity}"

@canvas_uri = "#{@uri_root}/canvas/#{@canvas_root}.json"
@list_uri = "#{@uri_root}/annotation/#{@label}.json"

@selector = get_selector
end

def manifest_canvas_on_xywh(xywh)
"#{@canvas_uri}#xywh=#{xywh}"
end

def get_selector
if @granularity == 'word'
'ocrx_word'
elsif @granularity == 'line'
'ocr_line'
elsif @granularity == 'paragraph'
'ocr_par'
else
''
end
end

def resources
@hocr.xpath(".//*[contains(@class, '#{@selector}')]").map do |chunk|
text = chunk.text().gsub("\n", ' ').squeeze(' ').strip
if !text.empty?
title = chunk['title']
title_parts = title.split('; ')
xywh = '0,0,0,0'
title_parts.each do |title_part|
if title_part.include?('bbox')
match_data = /bbox\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/.match title_part
x = match_data[1].to_i
y = match_data[2].to_i
x1 = match_data[3].to_i
y1 = match_data[4].to_i
w = x1 - x
h = y1 - y
xywh = "#{x},#{y},#{w},#{h}"
end
end
annotation(text, xywh)
end
end.compact
end

def annotation_list
{
:"@context" => "http://iiif.io/api/presentation/2/context.json",
:"@id" => annotation_list_id,
:"@type" => "sc:AnnotationList",
:"@label" => "OCR text with granularity of #{@granularity}",
resources: resources
}
end

def annotation_list_id_base
"{{ '/' | absolute_url }}\
img/derivatives/iiif/canvas/\
#{@collection}/#{@identifier}-annotation-list-#{@granularity}.json"

#File.join OKRACOKE_BASE_URL, @identifier + '-annotation-list-' + @granularity
end

def annotation_list_id
annotation_list_id_base + '.json'
end

def annotation(chars, xywh)
{
:"@id" => annotation_id(xywh),
:"@type" => "oa:Annotation",
motivation: "sc:painting",
resource: {
:"@type" => "cnt:ContentAsText",
format: "text/plain",
chars: chars
},
# TODO: use canvas_url_template
on: on_canvas(xywh)
}
end

def annotation_id(xywh)
File.join annotation_list_id_base, xywh
end

def on_canvas(xywh)
manifest_canvas_on_xywh(xywh)
end

def to_json
annotation_list.to_json
end

def id
@identifier
end

def to_yaml
yaml_list = {
'uri' => @list_uri,
'collection' => @collection,
'canvas' => @identifier,
'label' => @label,
'target' => @canvas_uri,
'resources' => []
}
annotation_list[:resources].each do |resource|
yaml_list['resources'] << {
'xywh' => resource[:@id].sub(/.*\/(.*)/, '\1'),
'chars' => resource[:resource][:chars]
}
end
yaml_list.to_yaml
end

def save
FileUtils.mkdir_p("./_data/annotations/#{@collection}/#{@collection}")
# TODO: handle item as distinct from collection
# TODO: do not overwrite existing file without asking
File.open("./_data/annotations/#{@collection}/#{@collection}/#{@collection}_#{@identifier}_ocr_#{@granularity}.yaml", 'w') do |file|
file.write(to_yaml)
end
end

end
end
2 changes: 2 additions & 0 deletions lib/wax_tasks.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
require 'safe_yaml'

# relative
require_relative 'wax_tasks/annotation'
require_relative 'wax_tasks/annotationlist'
require_relative 'wax_tasks/asset'
require_relative 'wax_tasks/collection'
require_relative 'wax_tasks/config'
Expand Down
32 changes: 32 additions & 0 deletions lib/wax_tasks/annotation.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# frozen_string_literal: true

#
class Annotation
def initialize(annotationlist_id, canvas_id, xywh,
resource = {}, options = {})
@annotationlist_id = annotationlist_id
@canvas_id = canvas_id
@xywh = xywh # TODO: validate xywh
@type = 'oa:Annotation'
@motivation = options[:motivation] || 'sc:painting'
@resource =
{
:@type => resource[:type] || 'cnt:ContentAsText',
chars: resource[:chars] || '',
format: resource[:format] || 'text/plain'
# TODO: extend or subclass this as needed for other kinds of annotations
}
@resource[:language] = resource[:language] unless resource[:language].nil?
end

def to_hash
{
:@context => 'http://iiif.io/api/presentation/2/context.json',
:@id => @annotationlist_id + '#' + @xywh,
:@type => @type,
motivation: @motivation,
resource: @resource,
on: @canvas_id + '#' + @xywh
}
end
end
60 changes: 60 additions & 0 deletions lib/wax_tasks/annotationlist.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# frozen_string_literal: true

module WaxTasks
#
class AnnotationList
attr_reader :canvas, :label

def initialize(annotation_list)
# input is in format of annotation list yaml
@uri = annotation_list['uri']
@collection = annotation_list['collection']
@canvas = annotation_list['canvas']
@label = annotation_list['label']
@target = annotation_list['target']

@type = 'sc:AnnotationList'
@resources = annotation_list['resources'].map do |resource|
{
:@type => resource['type'] || 'cnt:ContentAsText',
chars: resource['chars'] || '',
format: resource['format'] || 'text/plain',
xywh: resource['xywh'] || ''
# TODO: extend or subclass this as needed for other kinds of annotations
}
end
end

def to_json
{
:@context => 'http://iiif.io/api/presentation/2/context.json',
:@id => @uri,
:@type => @type,
label: @label,
resources: @resources.map do |resource|
{
:@type => 'oa:Annotation',
motivation: 'sc:painting',
resource: {
:@type => resource[:@type],
format: resource[:format],
chars: resource[:chars]
},
on: "#{@target}#xywh=#{resource[:xywh]}"
}
end
}.to_json
end

def save
path = "#{dir}/#{Utils.slug(@pid)}.md"
if File.exist? path
0
else
FileUtils.mkdir_p File.dirname(path)
File.open(path, 'w') { |f| f.puts "#{@hash.to_yaml}---" }
1
end
end
end
end
Loading