-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds BlacklightIiifSearch to UniversalViewer. (#2296)
* Adds BlacklightIiifSearc to UniversalViewer. * Correct error. * Adds and corrects rspec and adds hostname to URL helper. * adds require. * Moving classes to a better distributor file. * Switch concat method. * Remove invalid UTF-8 text. * Try a different tactic. * trying a different service encapsulation. * Another tactic change. * add search to config. * changes activation location. * puts search service on work level. * revert changes to config. * delivers iiif search url the most convenient way. * rubo * Adds transcript_text_tesi to default search. * Improves matching capabilities. * Try to make highlighting work. * Sanitizes solr document url. * match id structure to our manifest. * Reverts possibly superfluous config. * Reverts possibly superfluous config part 2. * Adds documentation and license information. * adds rspec.
- Loading branch information
Showing
15 changed files
with
353 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# frozen_string_literal: true | ||
require 'active_support/core_ext/module/delegation' | ||
require 'json' | ||
require 'nokogiri' | ||
|
||
# NOTE: This model is largely derived from IiifPrint's (v3.0.1) | ||
# IiifPrint::TextExtraction::AltoReader class. Minor changes have been made to bring | ||
# the code into Rubocop compliancy. The IiifPrint Gem application is licensed under the | ||
# Apache License 2.0. At the time of adopting this licensed work into this application, | ||
# Commercial use, Modification, and Private use were listed under this Gem's Permissions. | ||
# The referenced License can be found here: | ||
# https://github.com/scientist-softserv/iiif_print/blob/v3.0.1/LICENSE | ||
module Curate | ||
# Module for text extraction | ||
module TextExtraction | ||
# Class to obtain plain text and JSON word-coordinates from ALTO source | ||
class AltoReader | ||
attr_accessor :source, :doc_stream | ||
delegate :text, to: :doc_stream | ||
|
||
# SAX Document Stream class to gather text and word tokens from ALTO | ||
class AltoDocStream < Nokogiri::XML::SAX::Document | ||
attr_accessor :text, :words | ||
|
||
def initialize(image_width = nil) | ||
super() | ||
# scaling matters: | ||
@image_width = image_width | ||
@scaling = 1.0 # pt to px, if ALTO using points | ||
# plain text buffer: | ||
@text = '' | ||
# list of word hash, containing word+coord: | ||
@words = [] | ||
end | ||
|
||
# Return coordinates from String element attribute hash | ||
# | ||
# @param attrs [Hash] hash containing ALTO `String` element attributes. | ||
# @return [Array] Array of position x, y, width, height in px. | ||
def s_coords(attrs) | ||
height = scale_value((attrs['HEIGHT'] || 0).to_i) | ||
width = scale_value((attrs['WIDTH'] || 0).to_i) | ||
hpos = scale_value((attrs['HPOS'] || 0).to_i) | ||
vpos = scale_value((attrs['VPOS'] || 0).to_i) | ||
[hpos, vpos, width, height] | ||
end | ||
|
||
def compute_scaling(attrs) | ||
return if @image_width.nil? | ||
match = attrs.find { |e| e[0].casecmp?('WIDTH') } | ||
return if match.empty? | ||
page_width = match[1].to_i | ||
return if @image_width == page_width | ||
@scaling = page_width / @image_width.to_f | ||
end | ||
|
||
def scale_value(v) | ||
(v / @scaling).to_i | ||
end | ||
|
||
# Callback for element start, implementation of which ignores | ||
# non-String elements. | ||
# | ||
# @param name [String] element name. | ||
# @param attrs [Array] Array of key, value pair Arrays. | ||
def start_element(name, attrs = []) | ||
values = attrs.to_h | ||
compute_scaling(attrs) if name == 'Page' | ||
return if name != 'String' | ||
token = values['CONTENT'] | ||
@text += token | ||
@words << { | ||
word: token, | ||
coordinates: s_coords(values) | ||
} | ||
end | ||
|
||
# Callback for element end, used here to manage endings of lines and | ||
# blocks. | ||
# | ||
# @param name [String] element name. | ||
def end_element(name) | ||
@text += " " if name == 'String' | ||
@text += "\n" if name == 'TextBlock' | ||
@text += "\n" if name == 'TextLine' | ||
end | ||
|
||
# Callback for completion of parsing ALTO, used to normalize generated | ||
# text content (strip unneeded whitespace incidental to output). | ||
def end_document | ||
# postprocess @text to remove trailing spaces on lines | ||
@text = @text.split("\n").map(&:strip).join("\n") | ||
# remove trailing whitespace at end of buffer | ||
@text.strip! | ||
end | ||
end | ||
|
||
# Construct with either path | ||
# | ||
# @param xml [String], and process document | ||
def initialize(xml, image_width = nil, image_height = nil) | ||
@source = isxml?(xml) ? xml : File.read(xml) | ||
@image_width = image_width | ||
@image_height = image_height | ||
@doc_stream = AltoDocStream.new(image_width) | ||
parser = Nokogiri::XML::SAX::Parser.new(doc_stream) | ||
parser.parse(@source) | ||
end | ||
|
||
# Determine if source parameter is path or xml | ||
# | ||
# @param xml [String] either path to xml file or xml source | ||
# @return [true, false] true if string appears to be XML source, not path | ||
def isxml?(xml) | ||
xml.lstrip.start_with?('<') | ||
end | ||
|
||
# Output JSON flattened word coordinates | ||
# | ||
# @return [String] JSON serialization of flattened word coordinates | ||
def json | ||
words = @doc_stream.words | ||
Curate::TextExtraction::WordCoordsBuilder.json_coordinates_for( | ||
words: words, | ||
width: @image_width, | ||
height: @image_height | ||
) | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# frozen_string_literal: true | ||
|
||
# NOTE: This model is largely derived from IiifPrint's (v3.0.1) | ||
# IiifPrint::TextExtraction::WordCoordsBuilder class. Minor changes have been made to bring | ||
# the code into Rubocop compliancy. The IiifPrint Gem application is licensed under the | ||
# Apache License 2.0. At the time of adopting this licensed work into this application, | ||
# Commercial use, Modification, and Private use were listed under this Gem's Permissions. | ||
# The referenced License can be found here: | ||
# https://github.com/scientist-softserv/iiif_print/blob/v3.0.1/LICENSE | ||
module Curate | ||
# Module for text extraction (OCR or otherwise) | ||
module TextExtraction | ||
class WordCoordsBuilder | ||
# @params words [Array<Hash>] an array of hash objects that have the keys `:word` and `:coordinates`. | ||
# @params width [Integer] the width of the "canvas" on which the words appear. | ||
# @params height [Integer] the height of the "canvas" on which the words appear. | ||
# @return [String] a JSON encoded string. | ||
def self.json_coordinates_for(words:, width: nil, height: nil) | ||
new(words, width, height).to_json | ||
end | ||
|
||
def initialize(words, width = nil, height = nil) | ||
@words = words | ||
@width = width | ||
@height = height | ||
end | ||
|
||
# Output JSON flattened word coordinates | ||
# | ||
# @return [String] JSON serialization of flattened word coordinates | ||
def to_json | ||
coordinates = {} | ||
@words.each do |w| | ||
word_chars = w[:word] | ||
word_coords = w[:coordinates] | ||
if coordinates[word_chars] | ||
coordinates[word_chars] << word_coords | ||
else | ||
coordinates[word_chars] = [word_coords] | ||
end | ||
end | ||
payload = { width: @width, height: @height, coords: coordinates } | ||
JSON.generate(payload) | ||
end | ||
end | ||
end | ||
end |
77 changes: 77 additions & 0 deletions
77
app/models/concerns/blacklight_iiif_search/iiif_search_annotation_behavior.rb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# frozen_string_literal: true | ||
|
||
# Blacklight IIIF Search v1.0.0 Override: per this application's instructions, | ||
# this module must be overridden if coordinates will be provided within the results | ||
# of this Gem's search API. It was also necessary to override #annotation_id and | ||
# #canvas_uri_for_annotation so that we can match the format of each canvas' @id | ||
# url value. | ||
|
||
# customizable behavior for IiifSearchAnnotation | ||
module BlacklightIiifSearch | ||
module AnnotationBehavior | ||
## | ||
# Create a URL for the annotation | ||
# @return [String] | ||
def annotation_id | ||
"#{emory_iiif_id_url}/canvas/#{document[:id]}/annotation/#{hl_index}" | ||
end | ||
|
||
## | ||
# Create a URL for the canvas that the annotation refers to | ||
# @return [String] | ||
def canvas_uri_for_annotation | ||
"#{emory_iiif_id_url}/canvas/#{document[:id]}" + coordinates | ||
end | ||
|
||
# NOTE: The methods #coordinates, #fetch_and_parse_coords, and #default_coords below are largely derived | ||
# from IiifPrint's (v3.0.1) IiifPrint::BlacklightIiifSearch::AnnotationDecorator module methods of the | ||
# same name. The methods have been refactored to function according to our expectations. The IiifPrint Gem | ||
# application is licensed under the Apache License 2.0. At the time of adopting this licensed work into | ||
# this application, Commercial use, Modification, and Private use were listed under this Gem's Permissions. | ||
# The referenced License can be found here: | ||
# https://github.com/scientist-softserv/iiif_print/blob/v3.0.1/LICENSE | ||
|
||
## | ||
# return a string like "#xywh=100,100,250,20" | ||
# corresponding to coordinates of query term on image | ||
# @return [String] | ||
def coordinates | ||
coords_json = fetch_and_parse_coords | ||
return default_coords unless coords_json.present? && coords_json['coords'].present? && query.present? | ||
|
||
query_terms = query.split(' ').map(&:downcase) | ||
matches = coords_json['coords'].select do |k, _v| | ||
k.downcase =~ /(#{query_terms.join('|')})/ | ||
end | ||
coords_array = matches&.values&.flatten(1)&.[](hl_index) | ||
|
||
coords_array.present? ? "#xywh=#{coords_array.join(',')}" : default_coords | ||
end | ||
|
||
private | ||
|
||
## | ||
# a default set of coordinates | ||
# @return [String] | ||
def default_coords | ||
'#xywh=0,0,0,0' | ||
end | ||
|
||
## | ||
# return the JSON word-coordinates file contents | ||
# @return [JSON] | ||
def fetch_and_parse_coords | ||
coords = document['alto_xml_tesi'] | ||
return nil if coords.blank? | ||
begin | ||
JSON.parse(coords) | ||
rescue JSON::ParserError | ||
nil | ||
end | ||
end | ||
|
||
def emory_iiif_id_url | ||
"http://#{ENV['HOSTNAME'] || 'localhost:3000'}/iiif/#{parent_document[:id]}/manifest" | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# frozen_string_literal: true | ||
|
||
# SearchBuilder for full-text searches with highlighting and snippets | ||
class IiifSearchBuilder < Blacklight::SearchBuilder | ||
include Blacklight::Solr::SearchBuilderBehavior | ||
|
||
self.default_processor_chain += [:ocr_search_params] | ||
|
||
# set params for ocr field searching | ||
def ocr_search_params(solr_parameters = {}) | ||
solr_parameters[:facet] = false | ||
solr_parameters[:hl] = true | ||
solr_parameters[:'hl.fl'] = blacklight_config.iiif_search[:full_text_field] | ||
solr_parameters[:'hl.fragsize'] = 100 | ||
solr_parameters[:'hl.snippets'] = 10 | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.