Skip to content

Commit

Permalink
Adds BlacklightIiifSearch to UniversalViewer. (#2296)
Browse files Browse the repository at this point in the history
* Adds BlacklightIiifSearc to UniversalViewer.

* Correct error.

* Adds and corrects rspec and adds hostname to URL helper.

* adds require.

* Moving classes to a better distributor file.

* Switch concat method.

* Remove invalid UTF-8 text.

* Try a different tactic.

* trying a different service encapsulation.

* Another tactic change.

* add search to config.

* changes activation location.

* puts search service on work level.

* revert changes to config.

* delivers iiif search url the most convenient way.

* rubo

* Adds transcript_text_tesi to default search.

* Improves matching capabilities.

* Try to make highlighting work.

* Sanitizes solr document url.

* match id structure to our manifest.

* Reverts possibly superfluous config.

* Reverts possibly superfluous config part 2.

* Adds documentation and license information.

* adds rspec.
  • Loading branch information
bwatson78 authored Dec 12, 2024
1 parent a3b22bd commit 50ea34c
Show file tree
Hide file tree
Showing 15 changed files with 353 additions and 9 deletions.
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ git_source(:github) do |repo_name|
end

gem 'archivesspace-client'
gem 'blacklight_iiif_search'
gem 'bootsnap', require: false
gem 'bootstrap-sass', '~> 3.0'
gem 'bulkrax'
Expand Down
9 changes: 9 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -1412,6 +1412,10 @@ GEM
bootstrap-sass (~> 3.0)
openseadragon (>= 0.2.0)
rails
blacklight_iiif_search (1.0.0)
blacklight (~> 6.0)
iiif-presentation
rails (>= 4.2, < 6)
bootsnap (1.13.0)
msgpack (~> 1.2)
bootstrap-sass (3.4.1)
Expand Down Expand Up @@ -1792,6 +1796,10 @@ GEM
ice_nine (0.11.2)
iiif-image-api (0.2.0)
activesupport
iiif-presentation (1.1.0)
activesupport (>= 3.2.18)
faraday (>= 0.9)
json
iiif_manifest (1.1.1)
activesupport (>= 4)
iso8601 (0.9.1)
Expand Down Expand Up @@ -2373,6 +2381,7 @@ PLATFORMS
DEPENDENCIES
archivesspace-client
bixby (~> 3.0.1)
blacklight_iiif_search
bootsnap
bootstrap-sass (~> 3.0)
bulkrax
Expand Down
16 changes: 15 additions & 1 deletion app/controllers/catalog_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,19 @@ def self.modified_field
solr_name('system_modified', :stored_sortable, type: :date)
end

# CatalogController-scope behavior and configuration for BlacklightIiifSearch
include BlacklightIiifSearch::Controller

configure_blacklight do |config|
# configuration for Blacklight IIIF Content Search
config.iiif_search = {
full_text_field: 'transcript_text_tesi', # FileSet field
object_relation_field: 'is_page_of_ssi', # FileSet field
supported_params: %w[q page],
autocomplete_handler: 'iiif_suggest',
suggester_name: 'iiifSuggester'
}

config.view.gallery.partials = [:index_header, :index]
config.view.masonry.partials = [:index]
config.view.slideshow.partials = [:index]
Expand All @@ -35,10 +47,12 @@ def self.modified_field
config.http_method = :post

## Default parameters to send to solr for all search-like requests. See also SolrHelper#solr_search_params
# NOTE: transcript_text_tesi is needed here because the `iiif_search` path utilizes the default `search` qt to
# match terms in Full-Text search-enabled FileSets.
config.default_solr_params = {
qt: "search",
rows: 10,
qf: "title_tesim description_tesim creator_tesim keyword_tesim"
qf: "title_tesim description_tesim creator_tesim keyword_tesim transcript_text_tesi"
}

# solr field configuration for document/show views
Expand Down
5 changes: 3 additions & 2 deletions app/indexers/curate/file_set_indexer.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# frozen_string_literal: true

module Curate
class FileSetIndexer < Hyrax::FileSetIndexer
def generate_solr_document
Expand Down Expand Up @@ -93,9 +92,11 @@ def preservation_event_value(preservation_event)
preservation_event.pluck(:value).first
end

# All fields assigned here are utilized by BlacklightIiifSearch.
def full_text_fields(solr_doc)
solr_doc['alto_xml_tesi'] = object.alto_xml if object.alto_xml.present?
solr_doc['alto_xml_tesi'] = Curate::TextExtraction::AltoReader.new(object.alto_xml).json if object.alto_xml.present?
solr_doc['transcript_text_tesi'] = object.transcript_text if object.transcript_text.present?
solr_doc['is_page_of_ssi'] = object.parent.id if object.parent.present?
end
end
end
131 changes: 131 additions & 0 deletions app/lib/curate/text_extraction/alto_reader.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# frozen_string_literal: true
require 'active_support/core_ext/module/delegation'
require 'json'
require 'nokogiri'

# NOTE: This model is largely derived from IiifPrint's (v3.0.1)
# IiifPrint::TextExtraction::AltoReader class. Minor changes have been made to bring
# the code into Rubocop compliancy. The IiifPrint Gem application is licensed under the
# Apache License 2.0. At the time of adopting this licensed work into this application,
# Commercial use, Modification, and Private use were listed under this Gem's Permissions.
# The referenced License can be found here:
# https://github.com/scientist-softserv/iiif_print/blob/v3.0.1/LICENSE
module Curate
# Module for text extraction
module TextExtraction
# Class to obtain plain text and JSON word-coordinates from ALTO source
class AltoReader
attr_accessor :source, :doc_stream
delegate :text, to: :doc_stream

# SAX Document Stream class to gather text and word tokens from ALTO
class AltoDocStream < Nokogiri::XML::SAX::Document
attr_accessor :text, :words

def initialize(image_width = nil)
super()
# scaling matters:
@image_width = image_width
@scaling = 1.0 # pt to px, if ALTO using points
# plain text buffer:
@text = ''
# list of word hash, containing word+coord:
@words = []
end

# Return coordinates from String element attribute hash
#
# @param attrs [Hash] hash containing ALTO `String` element attributes.
# @return [Array] Array of position x, y, width, height in px.
def s_coords(attrs)
height = scale_value((attrs['HEIGHT'] || 0).to_i)
width = scale_value((attrs['WIDTH'] || 0).to_i)
hpos = scale_value((attrs['HPOS'] || 0).to_i)
vpos = scale_value((attrs['VPOS'] || 0).to_i)
[hpos, vpos, width, height]
end

def compute_scaling(attrs)
return if @image_width.nil?
match = attrs.find { |e| e[0].casecmp?('WIDTH') }
return if match.empty?
page_width = match[1].to_i
return if @image_width == page_width
@scaling = page_width / @image_width.to_f
end

def scale_value(v)
(v / @scaling).to_i
end

# Callback for element start, implementation of which ignores
# non-String elements.
#
# @param name [String] element name.
# @param attrs [Array] Array of key, value pair Arrays.
def start_element(name, attrs = [])
values = attrs.to_h
compute_scaling(attrs) if name == 'Page'
return if name != 'String'
token = values['CONTENT']
@text += token
@words << {
word: token,
coordinates: s_coords(values)
}
end

# Callback for element end, used here to manage endings of lines and
# blocks.
#
# @param name [String] element name.
def end_element(name)
@text += " " if name == 'String'
@text += "\n" if name == 'TextBlock'
@text += "\n" if name == 'TextLine'
end

# Callback for completion of parsing ALTO, used to normalize generated
# text content (strip unneeded whitespace incidental to output).
def end_document
# postprocess @text to remove trailing spaces on lines
@text = @text.split("\n").map(&:strip).join("\n")
# remove trailing whitespace at end of buffer
@text.strip!
end
end

# Construct with either path
#
# @param xml [String], and process document
def initialize(xml, image_width = nil, image_height = nil)
@source = isxml?(xml) ? xml : File.read(xml)
@image_width = image_width
@image_height = image_height
@doc_stream = AltoDocStream.new(image_width)
parser = Nokogiri::XML::SAX::Parser.new(doc_stream)
parser.parse(@source)
end

# Determine if source parameter is path or xml
#
# @param xml [String] either path to xml file or xml source
# @return [true, false] true if string appears to be XML source, not path
def isxml?(xml)
xml.lstrip.start_with?('<')
end

# Output JSON flattened word coordinates
#
# @return [String] JSON serialization of flattened word coordinates
def json
words = @doc_stream.words
Curate::TextExtraction::WordCoordsBuilder.json_coordinates_for(
words: words,
width: @image_width,
height: @image_height
)
end
end
end
end
47 changes: 47 additions & 0 deletions app/lib/curate/text_extraction/word_coords_builder.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# frozen_string_literal: true

# NOTE: This model is largely derived from IiifPrint's (v3.0.1)
# IiifPrint::TextExtraction::WordCoordsBuilder class. Minor changes have been made to bring
# the code into Rubocop compliancy. The IiifPrint Gem application is licensed under the
# Apache License 2.0. At the time of adopting this licensed work into this application,
# Commercial use, Modification, and Private use were listed under this Gem's Permissions.
# The referenced License can be found here:
# https://github.com/scientist-softserv/iiif_print/blob/v3.0.1/LICENSE
module Curate
# Module for text extraction (OCR or otherwise)
module TextExtraction
class WordCoordsBuilder
# @params words [Array<Hash>] an array of hash objects that have the keys `:word` and `:coordinates`.
# @params width [Integer] the width of the "canvas" on which the words appear.
# @params height [Integer] the height of the "canvas" on which the words appear.
# @return [String] a JSON encoded string.
def self.json_coordinates_for(words:, width: nil, height: nil)
new(words, width, height).to_json
end

def initialize(words, width = nil, height = nil)
@words = words
@width = width
@height = height
end

# Output JSON flattened word coordinates
#
# @return [String] JSON serialization of flattened word coordinates
def to_json
coordinates = {}
@words.each do |w|
word_chars = w[:word]
word_coords = w[:coordinates]
if coordinates[word_chars]
coordinates[word_chars] << word_coords
else
coordinates[word_chars] = [word_coords]
end
end
payload = { width: @width, height: @height, coords: coordinates }
JSON.generate(payload)
end
end
end
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# frozen_string_literal: true

# Blacklight IIIF Search v1.0.0 Override: per this application's instructions,
# this module must be overridden if coordinates will be provided within the results
# of this Gem's search API. It was also necessary to override #annotation_id and
# #canvas_uri_for_annotation so that we can match the format of each canvas' @id
# url value.

# customizable behavior for IiifSearchAnnotation
module BlacklightIiifSearch
module AnnotationBehavior
##
# Create a URL for the annotation
# @return [String]
def annotation_id
"#{emory_iiif_id_url}/canvas/#{document[:id]}/annotation/#{hl_index}"
end

##
# Create a URL for the canvas that the annotation refers to
# @return [String]
def canvas_uri_for_annotation
"#{emory_iiif_id_url}/canvas/#{document[:id]}" + coordinates
end

# NOTE: The methods #coordinates, #fetch_and_parse_coords, and #default_coords below are largely derived
# from IiifPrint's (v3.0.1) IiifPrint::BlacklightIiifSearch::AnnotationDecorator module methods of the
# same name. The methods have been refactored to function according to our expectations. The IiifPrint Gem
# application is licensed under the Apache License 2.0. At the time of adopting this licensed work into
# this application, Commercial use, Modification, and Private use were listed under this Gem's Permissions.
# The referenced License can be found here:
# https://github.com/scientist-softserv/iiif_print/blob/v3.0.1/LICENSE

##
# return a string like "#xywh=100,100,250,20"
# corresponding to coordinates of query term on image
# @return [String]
def coordinates
coords_json = fetch_and_parse_coords
return default_coords unless coords_json.present? && coords_json['coords'].present? && query.present?

query_terms = query.split(' ').map(&:downcase)
matches = coords_json['coords'].select do |k, _v|
k.downcase =~ /(#{query_terms.join('|')})/
end
coords_array = matches&.values&.flatten(1)&.[](hl_index)

coords_array.present? ? "#xywh=#{coords_array.join(',')}" : default_coords
end

private

##
# a default set of coordinates
# @return [String]
def default_coords
'#xywh=0,0,0,0'
end

##
# return the JSON word-coordinates file contents
# @return [JSON]
def fetch_and_parse_coords
coords = document['alto_xml_tesi']
return nil if coords.blank?
begin
JSON.parse(coords)
rescue JSON::ParserError
nil
end
end

def emory_iiif_id_url
"http://#{ENV['HOSTNAME'] || 'localhost:3000'}/iiif/#{parent_document[:id]}/manifest"
end
end
end
2 changes: 1 addition & 1 deletion app/models/file_set.rb
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ def alto_xml
end

def transcript_text
transcript_file&.content&.to_s if transcript_file&.file_name&.first&.include?('.txt')
transcript_file&.content&.force_encoding('UTF-8') if transcript_file&.file_name&.first&.include?('.txt')
end

private
Expand Down
17 changes: 17 additions & 0 deletions app/models/iiif_search_builder.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# frozen_string_literal: true

# SearchBuilder for full-text searches with highlighting and snippets
class IiifSearchBuilder < Blacklight::SearchBuilder
include Blacklight::Solr::SearchBuilderBehavior

self.default_processor_chain += [:ocr_search_params]

# set params for ocr field searching
def ocr_search_params(solr_parameters = {})
solr_parameters[:facet] = false
solr_parameters[:hl] = true
solr_parameters[:'hl.fl'] = blacklight_config.iiif_search[:full_text_field]
solr_parameters[:'hl.fragsize'] = 100
solr_parameters[:'hl.snippets'] = 10
end
end
9 changes: 9 additions & 0 deletions app/models/solr_document.rb
Original file line number Diff line number Diff line change
Expand Up @@ -162,4 +162,13 @@ def human_readable_visibility
def source_collection_title
self['source_collection_title_ssim']
end

# Added here since the SolrDocument is easily available within app/views/manifest/manifest.json.jbuilder partial.
def work_iiif_search_url
return ('http://localhost:3000/catalog/' + self['id'] + '/iiif_search') if ENV['IIIF_SERVER_URL'].blank?
parsed_iiif_url = URI.parse(ENV['IIIF_SERVER_URL'])
base_path = parsed_iiif_url.to_s[/\A.*(?=#{parsed_iiif_url.path}\z)/]

base_path + '/catalog/' + self['id'] + '/iiif_search'
end
end
Loading

0 comments on commit 50ea34c

Please sign in to comment.