Skip to content

Commit

Permalink
Merge pull request #127 from collectionspace/ignore-multi
Browse files Browse the repository at this point in the history
Add multiple_recs_found config option
  • Loading branch information
kspurgin authored Sep 23, 2021
2 parents fa7145d + 20bba94 commit 0955c6a
Show file tree
Hide file tree
Showing 9 changed files with 100 additions and 24 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

This project bumps the version number for any changes (including documentation updates and refactorings). Versions with only documentation or refactoring changes may not be released. Versions with bugfixes will be released. Changes made to unreleased versions will be indicated by version number under each release that includes those changes.

## [Unreleased]
## [Unreleased] - i.e. pushed to main branch but not yet tagged as a release
### Added
- `multiple_recs_found` batch configuration option added to allow batch deletion of duplicate records. This defaults to `fail`, which means if there are two or more existing records sharing the same ID, the batch importer will not transfer anything for that ID. In rare cases, however, you may really need to delete duplicates, and now you can. The batch importer will transfer your update or delete to the first result found via a search for the record ID. See [the batch configuration options documentation](https://github.com/collectionspace/collectionspace-mapper/blob/main/doc/batch_configuration.adoc) for more information.

## [2.4.9] - 2021-09-03
### Changed
Expand Down
22 changes: 22 additions & 0 deletions doc/batch_configuration.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ A JSON config hash may be passed to a new `Mapper::DataHandler` to control vario
"delimiter": ";",
"subgroup_delimiter": "^^",
"response_mode": "verbose",
"multiple_recs_found": "fail",
"check_terms" : true,
"check_record_status" : true,
"force_defaults": false,
Expand Down Expand Up @@ -80,6 +81,27 @@ If `verbose`, `Mapper::Response` also has the following attributes, which may be
- *Data type*: string
- *Allowed values*: `normal`, `verbose`

== multiple_recs_found

Controls what happens when the mapper looks up the status (new vs. existing) of the record being mapped in your CollectionSpace instance, and more than one record with the same ID is found.

If `fail`, the mapper returns an error for that record. You will not be able to transfer that record with the batch importer.

`fail` is the default because it is generally unsafe to update or delete a record when it's not clear which record should be updated.

WARNING: Do not use this option at all if you are not 100% certain of what it does. It has the potential to be very destructive to your data.

There may be odd cases where you end up with true duplicate records, in your system, however. The `use_first` value for this config option was added to enable batch deletion of known duplicate records. If your records with the same ID are not actually duplicates, this can be very destructive, so *use with care*.

If using this option to enable batch deletes of records with duplicate ids, you have no control over which record with the given id will be deleted. If they are true duplicate records, that is fine. Note that, only one record with a given ID is ever updated or deleted at a time via the CSV importer. If you had 3 records with the same id, and you used this option to do a delete transfer, you will still have 2 records with the same id in the system.

While it is possible to use this setting to batch update existing records that do not have unique ids, it is strongly discouraged. You will not have any control over _which_ of the records with a non-unique id is updated. If the records sharing an ID were not duplicate records, you may be updating the wrong record. If they were duplicates, they won't be after you update one, but you will still have duplicate ids in the system.

- *Required?:* no
- *Defaults to:* fail
- *Data type*: string
- *Allowed values*: `fail`, `use_first`

== check_terms

If `true`, looks up each term via `collectionspace-refcache`. If found, uses existing refname. If not found, searches for term via cspace-services API and uses existing refname if found. If term not found in refcache or services API, builds a new refname, uses that in the record, adds it to refcache, and returns the term with `found=false` in `Response::Terms`.
Expand Down
3 changes: 2 additions & 1 deletion lib/collectionspace/mapper/config.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ module Mapper
# or non-hierarchichal relationships via module extension
# :reek:InstanceVariableAssumption - instance variables are set during initialization
class Config
attr_reader :delimiter, :subgroup_delimiter, :response_mode, :force_defaults, :check_record_status,
attr_reader :delimiter, :subgroup_delimiter, :response_mode, :multiple_recs_found, :force_defaults, :check_record_status,
:check_terms, :date_format, :two_digit_year_handling, :transforms, :default_values,
:record_type
# todo: move default config in here
Expand All @@ -18,6 +18,7 @@ class Config
DEFAULT_CONFIG = { delimiter: '|',
subgroup_delimiter: '^^',
response_mode: 'normal',
multiple_recs_found: 'fail',
check_terms: true,
check_record_status: true,
force_defaults: false,
Expand Down
14 changes: 9 additions & 5 deletions lib/collectionspace/mapper/data_handler.rb
Original file line number Diff line number Diff line change
Expand Up @@ -167,11 +167,15 @@ def set_record_status(response)
else
status = searchresult[:status]
response.record_status = status
if status == :existing
response.csid = searchresult[:csid]
response.uri = searchresult[:uri]
response.refname = searchresult[:refname]
end
return if status == :new

response.csid = searchresult[:csid]
response.uri = searchresult[:uri]
response.refname = searchresult[:refname]
num_found = searchresult[:multiple_recs_found]
return unless num_found

response.add_multi_rec_found_warning(num_found)
end
end

Expand Down
11 changes: 11 additions & 0 deletions lib/collectionspace/mapper/response.rb
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,17 @@ def normal
def xml
doc ? doc.to_xml : nil
end

def add_multi_rec_found_warning(num_found)
warnings << {
category: :multiple_records_found_for_id,
field: nil,
type: nil,
subtype: nil,
value: nil,
message: "#{num_found} records found for #{identifier}. Using first record found: #{uri}"
}
end
end
end
end
Expand Down
33 changes: 24 additions & 9 deletions lib/collectionspace/mapper/tools/record_status_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -38,21 +38,30 @@ def lookup(value)

ct = count_results(response)
if ct == 0
report = { status: :new }
reportable_result
elsif ct == 1
report = {
status: :existing,
csid: response.parsed[@response_top][@response_nested]['csid'],
uri: response.parsed[@response_top][@response_nested]['uri'],
refname: response.parsed[@response_top][@response_nested]['refName']
}
reportable_result(response.parsed[@response_top][@response_nested])
elsif ct > 1
raise CollectionSpace::Mapper::MultipleCsRecordsFoundError.new(ct)
raise CollectionSpace::Mapper::MultipleCsRecordsFoundError.new(ct) unless use_first?

item = response.parsed[@response_top][@response_nested].first
num_found = response.parsed[@response_top][@response_nested].length
reportable_result(item).merge({ multiple_recs_found: num_found})
end
report
end

private

def reportable_result(item = nil)
return { status: :new } unless item

{
status: :existing,
csid: item['csid'],
uri: item['uri'],
refname: item['refName']
}
end

def lookup_non_relationship(value)
@client.find(
Expand All @@ -62,6 +71,12 @@ def lookup_non_relationship(value)
field: @search_field
)
end

def use_first?
return true if @mapper.batchconfig.multiple_recs_found == 'use_first'

false
end

def count_results(response)
unless response.result.success?
Expand Down
2 changes: 1 addition & 1 deletion lib/collectionspace/mapper/version.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
module CollectionSpace
module Mapper
VERSION = '2.4.9'
VERSION = '2.5.0'
end
end
2 changes: 1 addition & 1 deletion spec/collectionspace/mapper/config_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
let(:with_hash) { described_class.new(config: confighash) }
let(:with_nothing) { described_class.new }
let(:with_array) { described_class.new(config: [2, 3]) }
let(:expected_hash) { {:delimiter=>";", :subgroup_delimiter=>"^^", :response_mode=>"verbose", :force_defaults=>false, :check_record_status=>true, :check_terms=>true, :date_format=>"month day year", :two_digit_year_handling=>"convert to four digit", :transforms=>{"collection"=>{:special=>["downcase_value"], :replacements=>[{:find=>" ", :replace=>"-", :type=>"plain"}]}}, :default_values=>{"publishto"=>"DPLA;Omeka", "collection"=>"library-collection"}} }
let(:expected_hash) { {:delimiter=>";", :subgroup_delimiter=>"^^", :response_mode=>"verbose", :multiple_recs_found=>'fail', :force_defaults=>false, :check_record_status=>true, :check_terms=>true, :date_format=>"month day year", :two_digit_year_handling=>"convert to four digit", :transforms=>{"collection"=>{:special=>["downcase_value"], :replacements=>[{:find=>" ", :replace=>"-", :type=>"plain"}]}}, :default_values=>{"publishto"=>"DPLA;Omeka", "collection"=>"library-collection"}} }
let(:invalid_response) { {response_mode: 'mouthy'} }
let(:with_invalid_response) { described_class.new(config: invalid_response) }

Expand Down
33 changes: 27 additions & 6 deletions spec/collectionspace/mapper/tools/record_status_service_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,11 @@

describe '#lookup' do
context 'when mapper is for an authority' do
let(:mapper) { CollectionSpace::Mapper::RecordMapper.new(mapper: get_json_record_mapper(
'spec/fixtures/files/mappers/release_6_1/core/core_6-1-0_person-local.json'
)) }
let(:mapper) do
CollectionSpace::Mapper::RecordMapper.new(mapper: get_json_record_mapper(
'spec/fixtures/files/mappers/release_6_1/core/core_6-1-0_person-local.json'
))
end

context 'and one result is found' do
let(:report) { service.lookup('John Doe') }
Expand Down Expand Up @@ -50,11 +52,30 @@
end

context 'and multiple results found' do
# if this test fails, verify there are two person/local authority records for 'Inkpot Guineafowl'
# if these tests fail, verify there are two person/local authority records for 'Inkpot Guineafowl'
# in core.dev
# you may need to re-create them if they have been removed
it 'raises error because we cannot know what to do with imported record' do
expect{ service.lookup('Inkpot Guineafowl') }.to raise_error(CollectionSpace::Mapper::MultipleCsRecordsFoundError)
context 'with default config' do
it 'raises error because we cannot know what to do with imported record' do
expect{ service.lookup('Inkpot Guineafowl') }.to raise_error(CollectionSpace::Mapper::MultipleCsRecordsFoundError)
end
end

context 'with multiple_recs_found = use first in batchconfig' do
let(:json) do
uri = 'spec/fixtures/files/mappers/release_6_1/core/core_6-1-0_person-local.json'
get_json_record_mapper(uri)
end
let(:mapper) do
CollectionSpace::Mapper::RecordMapper.new(
mapper: json,
batchconfig: { multiple_recs_found: 'use_first' }
)
end
let(:result) { service.lookup('Inkpot Guineafowl').keys.any?(:multiple_recs_found) }
it 'returns result with count of records found' do
expect(result).to be true
end
end
end
end
Expand Down

0 comments on commit 0955c6a

Please sign in to comment.