duplicate results #67

diatomsRcool · 2017-06-28T19:30:02Z

I'm getting duplicate results from effechecka. The following are some examples
{"taxon": "Plantae|Tracheophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.", "recordcount": 276}, {"taxon": "Plantae|Magnoliophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.", "recordcount": 276}

and

{"taxon": "Plantae|Tracheophyta|Magnoliopsida|Fabales|Fabaceae|Vicia|faba|Vicia faba L.", "recordcount": 135}, {"taxon": "Plantae|Magnoliophyta|Magnoliopsida|Fabales|Fabaceae|Vicia|faba|Vicia faba L.", "recordcount": 135}

and

{"taxon": "Animalia|Chordata|Reptilia|Squamata|Lacertidae|Eremias|lineolata|Eremias lineolata Nikolsky, 1897", "recordcount": 19}, {"taxon": "Animalia|Chordata|Reptilia|Squamata|Lacertidae|Eremias|lineolata|Eremias lineolata (Nikolsky, 1897)", "recordcount": 19}

I think its strange that the record counts are exactly the same for each one. Does this mean I can just ignore one of them?

jhammock · 2017-06-28T19:42:52Z

Interesting. The short answer is certainly yes. I wonder if this is a data record issue or a names issue.

jhpoelen · 2017-06-28T21:29:07Z

Nicely spotted!

As far as I can tell, the root cause is that the taxonomic paths are different, even though, clearly, the same taxa are referenced.

For instance, compare:

Plantae|Tracheophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.
Plantae|Magnoliophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.

Effechecka does not currently do taxonomic integration as part of the checklist building process. This integration is definitely possible, and is is currently out of scope.

jhpoelen · 2017-06-28T21:32:22Z

My suspicion about the duplicate results is that the occurrences were "cleaned" or otherwise changed taxonomic path as part of internal cleanup or transfer to different system (inaturalist -> gbif).

Please note that after resolving the taxa using resolver.globalnames.org to eol ids and removing duplicate ids, this issue should go away.

jhpoelen · 2017-06-28T23:29:46Z

Just to add to this - the reason why taxonomic integration is not done is mainly added complexity and computational overhead. If you can think of a method to compare taxonomic paths in a smarter way than a string comparison and some simple string transformations, please do holler.

diatomsRcool · 2017-07-06T19:59:30Z

So that leaves the question.....
Should I be adding the record counts? At first I thought I should not add them because the duplicate taxa had the same record count, but now I'm seeing duplicate taxa with different record counts. Thoughts?

jhammock · 2017-07-06T20:13:20Z

yeah, arg. I think we should leave them in with a caveat in the remarks. They're useful mostly for determining whether records in a checklist are trustworthy, and that only calls for ~order of magnitude of sample size.

jhammock · 2017-07-06T20:14:48Z

oh, sorry- you mean, should you be summing the counts across duplicate taxa. Hmm... Arg. I'm not sure, but I stand by my caveat in remarks idea.

diatomsRcool · 2017-07-06T20:26:22Z

Yes, I've been leaving them in, but not adding them. I don't know enough about where the data come from to know if I should be summing the record count across the duplicate taxa. It shouldn't affect the presence data, so I guess that's a positive?

jhammock · 2017-07-06T20:38:10Z

Yeah. Somewhat arbitrarily, how about not summing, and selecting the highest individual result?

diatomsRcool · 2018-01-05T15:29:40Z

jhpoelen · 2018-01-05T16:01:08Z

Thanks for the example.

I agree that while the names are different character by character, they are semantically the same. The names are mentioned separately because effechecka does not do any kind of name processing of sorts. In this case, Lophuromys aquilus True, 1892 and Lophuromys aquilus (True, 1892) are mentioned separately because of the parentheses. Adding a post-processing step (in a fancier version of effechecka or included in some downstream processing) using globalnames.org or some other taxonomic name parser would help remove these duplicates. I hope that we'll get a chance to work on this at some point.

jhpoelen added the external data issue label Jul 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate results #67

duplicate results #67

diatomsRcool commented Jun 28, 2017

jhammock commented Jun 28, 2017

jhpoelen commented Jun 28, 2017 •

edited

Loading

jhpoelen commented Jun 28, 2017

jhpoelen commented Jun 28, 2017 •

edited

Loading

diatomsRcool commented Jul 6, 2017

jhammock commented Jul 6, 2017

jhammock commented Jul 6, 2017

diatomsRcool commented Jul 6, 2017

jhammock commented Jul 6, 2017

diatomsRcool commented Jan 5, 2018

jhpoelen commented Jan 5, 2018

duplicate results #67

duplicate results #67

Comments

diatomsRcool commented Jun 28, 2017

jhammock commented Jun 28, 2017

jhpoelen commented Jun 28, 2017 • edited Loading

jhpoelen commented Jun 28, 2017

jhpoelen commented Jun 28, 2017 • edited Loading

diatomsRcool commented Jul 6, 2017

jhammock commented Jul 6, 2017

jhammock commented Jul 6, 2017

diatomsRcool commented Jul 6, 2017

jhammock commented Jul 6, 2017

diatomsRcool commented Jan 5, 2018

jhpoelen commented Jan 5, 2018

jhpoelen commented Jun 28, 2017 •

edited

Loading

jhpoelen commented Jun 28, 2017 •

edited

Loading