NOTE: Currently this repository only accommodates HTML, TEI, webs (web scraped HTML), and VRA. The process to customize them should be nearly identical, so for the purposes of this documentation, we will be working with TEI.
If you are already familiar with XSLT / XPath, you may be interested in comparing Ruby / Nokogiri equivalents with XSLT behavior.
The following document may be easier to follow if you have a working knowledge of the following in Ruby:
- variables
- arrays
- hashes
- string methods
- (optional) classes and inheritance
In general, the data repository is set up so that you may override ANYTHING inside a class or module in the scripts/ruby
(sub)directories by putting it in your collection's script directory. You need only have the class / module name at the top and the exact name of the method you are overriding and the rest of the original file's contents will not be altered. This is how we will customize the TEI to Elasticsearch transformation. If you are customizing a VRA script, just sub that into the filenames and classes below.
If a file does not already exist at scripts/overrides/tei_to_es.rb
, create one that looks like this:
class TeiToEs
end
There are several types of overrides that you can do in this class:
- xpaths
- the json being sent to Elasticsearch
- field behavior
You may also
- add new xpaths
- add new fields
- access the @options from the config files
- read in additional XML files (annotations, references, etc)
I'll go through them in sections.
If you require more flexibility than simply changing some similar xpaths or getting a list of items, the following may interest you.
In the tei_to_es.rb
file, you have access to a couple variables.
@filename
@id
@json
@options
@parent_xml
@xml
@xpaths
This is the name of the file this (sub)document is pulling from without an extension.
Often associated with the filename, this is created with the get_id
method in TeiToEs whose default behavior may be overridden. The @id is generally NOT the filename when you're dealing with subdocuments, like from a personography file.
This is the JSON structure being built to sent to elasticsearch. You may directly manipulate it as you would a traditional hash, but keep in mind that its contents may not be populated yet. You will need to look at the scripts/ruby/lib/tei_to_es/request.rb
file to see the conventional json build order.
Typically I would recommend altering this in the postprocessing
step, if you need to work with the json directly, but every collection is different so feel free to poke around with it and use it directly!
The @options
are created from combining the config files and any parameters you passed into via the command line. You can see all the options being used by adding -v
when you post. This means that anything you add to the config file can be accessed when generating the ES request. For example, you might add:
# public.yml
default:
copyright_info: Your statement here
# tei_to_es.rb
def rights
@options["copyright_info"]
end
To see the entire list of @options
available to your collection, run this command:
print_options -e [environment]
Just ignore this. This is only relevant if you are working with subdocuments, like from a personography file, encyclopedia, or perhaps if you wanted to split a single file book's chapters into individual search items. Please see the documentation on subdocuments TODO docs.
This is your document's XML as a Nokogiri object. You can use methods like get_text
which are built into this repository, or you can operate on the object directly.
If you really wanted to, you would be able to add / alter the XML itself, if that would somehow help you in your quest. It will not alter the actual TEI documents on the filesystem unless if you really know what you're doing and intentionally try to alter the original files. There's no risk of that happening accidentally, so feel free to alter the XML object as preprocessing if you need to!
This is the hash which contains all of the xpaths your document should be using. Typically you will use it to access either the default xpaths in scripts/ruby/lib/tei_to_es/xpaths.rb
or any you've overridden / added.
@xpaths["title"]
There may be times that you can't accomplish everything that you want to do using some of the built in tools provided by datura. For example, if you're selecting a person's name and need to parse the parent node's attributes to build an object. Then it's time to remember that @xml
is a Nokogiri::XML::Element
For the most part, however, you should be able to accomplish your goals using the methods in Getting XPath Contents.
At scripts/ruby/lib/tei_to_es/xpaths.rb
you can find a list of all the default xpaths being used by the default fields. If you run your collection with no customization, the script will do the best it can with these defaults. But let's say that your xml file doesn't have the publisher information in the same place as the default.
# default xpath
"publisher" => /TEI/teiHeader/fileDesc/sourceDesc/bibl[1]/publisher[1]"
You would add the following code to your collection's tei_to_es.rb
file:
def override_xpaths
# create an empty container for your overrides
xpaths = {}
# add the publisher xpath
xpaths["publisher"] = "/TEI/teiHeader/fileDesc/publicationStmt/publisher[1]"
# the last line should always be your new xpaths
xpaths
end
You may also define your xpaths like this, which is a bit simpler. However,
using this system may make it more difficult to override fields such as
source
with nested xpaths. This is a case where it will be helpful if you
understand Ruby syntax for Hashes!
def override_xpaths
{
"publisher" => "/TEI/teiHeader/fileDesc/publicationStmt/publisher[1]"
}
end
Now your collection will use your xpath but otherwise behave the same way as before. That is, if it was returning a list of publishers previously, it will still attempt to return a list. In our example, we know there is only [1]
(the first) publisher being returned, so it's a good idea to specify that again in our override.
But let's say we need to override something that has several options! How would we change this?
{
"creator" => [
"/TEI/teiHeader/fileDesc/titleStmt/author",
"//persName[@type = 'author']",
"/TEI/teiHeader/fileDesc/sourceDesc/bibl/author",
"/TEI/teiHeader/fileDesc/sourceDesc/biblStruct/monogr/author",
"//correspDesc/correspAction[@type='sentBy']/persName"
],
}
Well, if you don't need all those options, you can just entirely redefine it:
{
"creator" => "//our/new/xpath"
}
You can also add a whole list of xpaths, possibly including some of the xpaths that will otherwise be overridden:
{
"creator" => [
"//persName[@type='author']",
"//our/new/xpath",
]
}
Another type of complicated xpath is demonstrated by the source
field. Assuming
that you only need to change a few values in source, you may want to only alter
them, but you are, of course, able to overwrite the entire source hash if desired.
xpaths = {}
# overwrite ALLLLL the source fields
xpaths["source"] = {
# field => xpath
}
# overwrite just some of the fields
xpaths["source"]["author"] = "new/xpath"
xpaths["source"]["title"] = "new/xpath"
}
xpaths
Datura's lib/datura/to_es/tei_to_es/fields.rb
has a list of functions which generate fields used by elasticsearch. You can override them to add new behavior. Inside these overrides you can access information stored in the config file by using the @options
object. You may also do comparisons, check content of xpaths, and more!
The following are helpers created to make your life easier:
- get_text
- grab the text of one or many xpaths
- get_list
- create a list of text from one or many xpaths
- get_elements
- get the xml elements from one or many xpaths
Here's how they might be used:
<TEI>
<person dept="science">Jadzia Dax</person>
<person dept="engineering">Geordi LaForge</person>
</TEI>
get_list("/TEI/person")
#=> ["Jadzia Dax", "Geordi LaForge"]
get_text("/TEI/person")
#=> "Jadzia Dax; Geordi LaForge"
get_element('/TEI/person')
#=> <person dept="science"Jadzia Dax</person>, <person dept="engineering">Geordi LaForge</person>
By default, all of the above methods operate on the @xml
object, which is
typically the specific TEI file. However, you may ask them to operate on a different
file or sub-set with the xml
keyword. This may be helpful if you are working
with other documents, such as authority files, or if you are working within
elements in your TEI document, such as pages or personography entries.
# let's say that you need to pull in a title from an authority file
# named `@works`:
work_id = get_text("/ref/@id")
get_text("//works/work[@id='#{work_id}'']", xml: @works)
get_text
and get_list
parameters:
- keep_tags (optional): defaults to false, pass in true if you want to convert italics, bold, underline to HTML
- xml (optional) : defaults to entire document XML, or pass in your own XML object
get_text
only:
- delimiter (optional) : defaults to ";", the separator between multiple items
This looks like the following:
get_list(xpaths, keep_tags: [keep_tags], xml: [xml])
get_text(xpaths, keep_tags: [keep_tags], xml: [xml], delimiter: [delimiter])
By default, get_text
and get_list
will strip out XML from the results of the xpath. If you would like to preserve italics, bold, and underlining, pass in the "keep_tags" parameter to convert them from TEI to HTML:
<TEI>
<body>She wrote the book <hi rend="italics">My 100 Year Old Moth</hi></body>
get_text(@xpaths["text"], keep_tags: true)
#=> "She wrote the book <em>My 100 Year Old Moth</em>"
You can also customize the delimiter when get_text
encounters multiple results (which defaults to using ";")
get_text("/TEI/person", delimiter: ",")
#=> "Jadzia Dax, Geordi LaForge"
get_text(@xpaths["people"], delimiter: " &")
#=> "Jadzia Dax & Geordi LaForge"
You can use get_elements
to access additional attributes and information from xpaths.
people = get_elements("/TEI/person")
#=> <person dept="science"Jadzia Dax</person>, <person dept="engineering">Geordi LaForge</person>
# using nokogiri directly
people.each do |p|
puts p.text
puts p["dept"]
end
# using datura methods
people.each do |p|
puts get_text(".", xml: p)
puts get_text("@dept", xml: p)
end
#=> Jadzia Dax
#=> science
#=> Geordi LaForge
#=> engineering
Now that you know how to set xpaths and get their values, let's move to the part where you override field behavior.
You can override any of the fields found in datura's /lib/datura/to_es/tei_to_es/fields.rb
by copying them into your tei_to_es.rb
file and changing what they return.
Here is a very basic example which changes to a hardcoded string response:
# default version
def rights
get_text(@xpaths["rights"])
end
# collection version
def rights
"For rights information, visit collectionname.unl.edu/rights"
end
Oh shoot, but your collection actually has information encoded in the TEI document itself which varies from file to file! Not to worry, you can override the field AND use one of your xpaths!
# collection version
def rights
if (file has some criteria)
get_text(@xpaths["rights"])
else
"CC 4.0 License"
end
end
There might be circumstances where what you want to be in a field might depend on the contents of multiple xpaths or options, etc.
default:
uri: actualsite.unl.edu
display: true
development:
uri: cdrhdev.unl.edu
def somefield
if @options["display"]
"#{@options["uri"]}/#{@id}"
else
"Site unavailable"
end
end
Note for devs: You may also access the raw @xml
object, @id
, and @file
. @file
is an instance of FileTei which inherits from FileType, so feel free to use available attributes / methods related to that class.
This is the most difficult type of override you'll likely encounter. Some of elasticsearch's fields are "nested," meaning that you can add descriptive information about something. For example:
"id": "cat.8219",
"authors": [
{
"name": "Willa, Cather",
"role": "Author",
"id": "cat.per0001"
},
{
"name": "So and So",
"role": "Editor",
"id": "cat.per0218"
}
]
Fields like creator, contributor, and person fall into this nested category. The following will walk you through how one of these fields is built, using one of the more complicated versions.
# 1
def contributor
# we're ready to start customizing this field
end
# 2
def contributor
# let's start by grabbing whatever we've got at the xpath
contribs = get_elements(@xpaths["contributor"])
end
# 3
def contributor
# we need to do something with each contributor in order to pull
# out information like identifier and role, so we add `.map` to iterate
contribs = get_elements(@xpaths["contributor"]).map do |ele|
end
end
# 4
def contributor
# with each contributor element, we can grab the "id" and "role"
contribs = get_elements(@xpaths["contributor"]).map do |ele|
{
"id" => get_text("@id", xml: ele),
"name" => get_text(".", xml: ele),
"role" => get_text("@role", xml: ele)
}
end
end
# 5
def contributor
contribs = get_elements(@xpaths["contributor"]).map do |ele|
{
"id" => get_text("@id", xml: ele),
"name" => get_text(".", xml: ele),
"role" => get_text("@role", xml: ele)
}
end
# finally, we take our list of contributors and crush down any
# that are not unique. We don't need one person's name showing up 3 times,
# after all!
contribs.uniq
end
Hopefully the majority of your field requirements are much less complicated than needing to recreate / imitate these nested fields!