-
Notifications
You must be signed in to change notification settings - Fork 2
duplicate results #67
Comments
Interesting. The short answer is certainly yes. I wonder if this is a data record issue or a names issue. |
Nicely spotted! As far as I can tell, the root cause is that the taxonomic paths are different, even though, clearly, the same taxa are referenced. For instance, compare:
Effechecka does not currently do taxonomic integration as part of the checklist building process. This integration is definitely possible, and is is currently out of scope. |
My suspicion about the duplicate results is that the occurrences were "cleaned" or otherwise changed taxonomic path as part of internal cleanup or transfer to different system (inaturalist -> gbif). Please note that after resolving the taxa using resolver.globalnames.org to eol ids and removing duplicate ids, this issue should go away. |
Just to add to this - the reason why taxonomic integration is not done is mainly added complexity and computational overhead. If you can think of a method to compare taxonomic paths in a smarter way than a string comparison and some simple string transformations, please do holler. |
So that leaves the question..... |
yeah, arg. I think we should leave them in with a caveat in the remarks. They're useful mostly for determining whether records in a checklist are trustworthy, and that only calls for ~order of magnitude of sample size. |
oh, sorry- you mean, should you be summing the counts across duplicate taxa. Hmm... Arg. I'm not sure, but I stand by my caveat in remarks idea. |
Yes, I've been leaving them in, but not adding them. I don't know enough about where the data come from to know if I should be summing the record count across the duplicate taxa. It shouldn't affect the presence data, so I guess that's a positive? |
Yeah. Somewhat arbitrarily, how about not summing, and selecting the highest individual result? |
A bit more info: |
Thanks for the example. I agree that while the names are different character by character, they are semantically the same. The names are mentioned separately because effechecka does not do any kind of name processing of sorts. In this case, |
I'm getting duplicate results from effechecka. The following are some examples
{"taxon": "Plantae|Tracheophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.", "recordcount": 276}, {"taxon": "Plantae|Magnoliophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.", "recordcount": 276}
and
{"taxon": "Plantae|Tracheophyta|Magnoliopsida|Fabales|Fabaceae|Vicia|faba|Vicia faba L.", "recordcount": 135}, {"taxon": "Plantae|Magnoliophyta|Magnoliopsida|Fabales|Fabaceae|Vicia|faba|Vicia faba L.", "recordcount": 135}
and
{"taxon": "Animalia|Chordata|Reptilia|Squamata|Lacertidae|Eremias|lineolata|Eremias lineolata Nikolsky, 1897", "recordcount": 19}, {"taxon": "Animalia|Chordata|Reptilia|Squamata|Lacertidae|Eremias|lineolata|Eremias lineolata (Nikolsky, 1897)", "recordcount": 19}
I think its strange that the record counts are exactly the same for each one. Does this mean I can just ignore one of them?
The text was updated successfully, but these errors were encountered: