Skip to content
This repository has been archived by the owner on Jul 3, 2019. It is now read-only.

duplicate results #67

Open
diatomsRcool opened this issue Jun 28, 2017 · 11 comments
Open

duplicate results #67

diatomsRcool opened this issue Jun 28, 2017 · 11 comments

Comments

@diatomsRcool
Copy link

I'm getting duplicate results from effechecka. The following are some examples
{"taxon": "Plantae|Tracheophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.", "recordcount": 276}, {"taxon": "Plantae|Magnoliophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.", "recordcount": 276}

and

{"taxon": "Plantae|Tracheophyta|Magnoliopsida|Fabales|Fabaceae|Vicia|faba|Vicia faba L.", "recordcount": 135}, {"taxon": "Plantae|Magnoliophyta|Magnoliopsida|Fabales|Fabaceae|Vicia|faba|Vicia faba L.", "recordcount": 135}

and

{"taxon": "Animalia|Chordata|Reptilia|Squamata|Lacertidae|Eremias|lineolata|Eremias lineolata Nikolsky, 1897", "recordcount": 19}, {"taxon": "Animalia|Chordata|Reptilia|Squamata|Lacertidae|Eremias|lineolata|Eremias lineolata (Nikolsky, 1897)", "recordcount": 19}

I think its strange that the record counts are exactly the same for each one. Does this mean I can just ignore one of them?

@jhammock
Copy link

Interesting. The short answer is certainly yes. I wonder if this is a data record issue or a names issue.

@jhpoelen
Copy link
Owner

jhpoelen commented Jun 28, 2017

Nicely spotted!

As far as I can tell, the root cause is that the taxonomic paths are different, even though, clearly, the same taxa are referenced.

For instance, compare:

Plantae|Tracheophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.
Plantae|Magnoliophyta|Magnoliopsida|Fabales|Fabaceae|Cicer|arietinum|Cicer arietinum L.

Effechecka does not currently do taxonomic integration as part of the checklist building process. This integration is definitely possible, and is is currently out of scope.

@jhpoelen
Copy link
Owner

My suspicion about the duplicate results is that the occurrences were "cleaned" or otherwise changed taxonomic path as part of internal cleanup or transfer to different system (inaturalist -> gbif).

Please note that after resolving the taxa using resolver.globalnames.org to eol ids and removing duplicate ids, this issue should go away.

@jhpoelen
Copy link
Owner

jhpoelen commented Jun 28, 2017

Just to add to this - the reason why taxonomic integration is not done is mainly added complexity and computational overhead. If you can think of a method to compare taxonomic paths in a smarter way than a string comparison and some simple string transformations, please do holler.

@diatomsRcool
Copy link
Author

So that leaves the question.....
Should I be adding the record counts? At first I thought I should not add them because the duplicate taxa had the same record count, but now I'm seeing duplicate taxa with different record counts. Thoughts?

@jhammock
Copy link

jhammock commented Jul 6, 2017

yeah, arg. I think we should leave them in with a caveat in the remarks. They're useful mostly for determining whether records in a checklist are trustworthy, and that only calls for ~order of magnitude of sample size.

@jhammock
Copy link

jhammock commented Jul 6, 2017

oh, sorry- you mean, should you be summing the counts across duplicate taxa. Hmm... Arg. I'm not sure, but I stand by my caveat in remarks idea.

@diatomsRcool
Copy link
Author

Yes, I've been leaving them in, but not adding them. I don't know enough about where the data come from to know if I should be summing the record count across the duplicate taxa. It shouldn't affect the presence data, so I guess that's a positive?

@jhammock
Copy link

jhammock commented Jul 6, 2017

Yeah. Somewhat arbitrarily, how about not summing, and selecting the highest individual result?

@diatomsRcool
Copy link
Author

A bit more info:
I'm also seeing duplicates when the name string is different, not just the higher level classification.
Lophuromys aquilus True, 1892 Animalia|Chordata|Mammalia|Rodentia|Muridae|Lophuromys|aquilus|Lophuromys aquilus True, 1892 734
Lophuromys aquilus (True, 1892) Animalia|Chordata|Mammalia|Rodentia|Muridae|Lophuromys|aquilus|Lophuromys aquilus (True, 1892) 734

@jhpoelen
Copy link
Owner

jhpoelen commented Jan 5, 2018

Thanks for the example.

I agree that while the names are different character by character, they are semantically the same. The names are mentioned separately because effechecka does not do any kind of name processing of sorts. In this case, Lophuromys aquilus True, 1892 and Lophuromys aquilus (True, 1892) are mentioned separately because of the parentheses. Adding a post-processing step (in a fancier version of effechecka or included in some downstream processing) using globalnames.org or some other taxonomic name parser would help remove these duplicates. I hope that we'll get a chance to work on this at some point.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants