-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing OTTs result in missing images and links in the Extinct tree #103
Comments
I switched to QIDs and it works very well. The big downside is that the extinct tree then uses IDs that don't match that of the main tree, which can add friction when linking from one to the other. For now, it is still a good approach, but it's worth discussing alternatives. One approach is to simply leave those nodes without OTTs, and fix all the OneZoom issues that happen in the absence of OTTs:
The downside is that it may be a fair be of work to fix everything properly, and without causing perf or functionality regressions. Another approach is to use whatever real OTTs we can get, and then use 'fake OTTs' for the missing ones. Sub-approaches:
Summarize, we have the following options:
|
I'm nervous about switching to QIDs throughout, It'll make a mess of the code base that would take a long time to recover from. Hearsay, but friends have been burnt by assuming these are stable before. Most of the issues @davidebbo has found are things that would be better off fixed, ignoring the extinct tree. I'd definitely prefer if we allowed
...but this is a pretty big side-quest. Negative OTT for the extinct tree is a bodge I can get behind. I doubt it'd just work though. We'd need some extra logic in places it really needs to be an OTT, not just a internal identifier. There are likely to be lurking regexes that assume |
To clarify, the 'switching to QIDs' that I'm using doesn't touch the code base at all, as it just jams QIDs in the OTT columns. See the first comment in this issue, which has more details. It's a cheap point-in-time solution to get around all those issues in the extinct tree for now.
Is there in fact anywhere that expects it to be an actual OTT? |
The OpenTree link when in expert mode does, but I can't think of much else. |
Didn't know there was an expert mode! But yeah, I found it looking at sources: https://www.onezoom.org/life_expert/, and I see the Open Tree tab. Anyway, I'm not concerned about this being broken for the extinct tree for now. I just thought of an alternate solution: give a list of missing taxa to our OT friends and ask them to create OTTs for them. After all, creating OTT IDs should be cheap and easy. It's placing them in the tree that's hard, and at this point, we don't need them to be. Or maybe I'm being naive about this? @hyanwong do you think this is worth pursuing? |
A good thing there would be to incorporate the extinct studies that have those taxa in their definition. I think the study curator allows you to coin new taxa: I don't know if those get allocated new OTTs, but I presume they do. I don't know if there is a UI for adding OTTs, but perhaps there should be (or I suspect you can add them as exceptions into an OpenTree GH repo). Do these missing taxa have NCBI or GBIF identifiers on wikidata (I presume they at least have a GBIF). @jrosindell is chatting to Emily Jane McTavish, and could probably ask her. |
I looked at the first 3 and they dont have either:
I'm not sure what to make of it. They could be bogus taxa, or those DBs might just be incomplete. But googling any of these 3 taxa yields quite a few results, leading me to think that they are at least somewhat well recognized. |
Actually, 2 of these are in gbif. It's just that wikidata doesn't have the link: |
OK, well, I guess we can add them to wikidata ourselves! It might be tedious to do this by hand, but perhaps we can automate it. |
Ideally, gbif would run a process that adds their IDs to all wikidata entries that don't have one, rather than having everyone fix up a few for their own needs. But stepping back, for our needs, we don't really need gbif to be in wikidata. We mostly just need stable OTT IDs to exist for all taxa that we're using. |
Yes, but ideally we need a way to map the OTT to wikipedia, and my guess for those missing taxa is that will be via a GBIF identifier |
In the extinct tree, it is not necessary since it all starts from Wikipedia, so we always have that. That is how I'm able to use QIDs as primary identifier today, and have everything work well. When it comes down to it, the only reason we need an OTT is to avoid all the issues listed above when we don't have any. Now obviously, having all those additional mappings like gbif available would be a good thing in general, but they are not blockers. |
Ah, right. But then we need to have a (non-taxon-name) way to map from Wikidata back to OTT, which will probably be via a GBIF identifier, right? |
I don't think we do. If the OTT exists, we'll have both the OTT (coming from add_ott_numbers_to_trees), and the QID (which we get intrinsically from being Wiki-first). So the mapping is there without needing GBIF. |
Ah, so you mean we are relying on the OpenTree TNRS to do the mapping. I see. It's sometimes good to try to avoid that where possible, but I guess that's the norm in the extinct tree. |
I'm confused. Isn't OpenTree TNRS what we're already using for the extant tree? That's what Maybe easier the chat directly in the next meeting? |
We only use the TNRS for taxa that are "hand-crafted". The vast majority of the tree is straight from OpenTree, so has the OTTs embedded in the tree data, and we don't need to look up names. This is useful, because scientific names (especially above the species level) are not guaranteed to be unique: E.g. there are 3 or 4 taxa called "Ctenophora": a genus of flies, a phylum of early diverging animals, and something else that I forget. Even with species names we can have overlap: In the parts of the tree that we aren't getting from OpenTree, I sometimes hard-code the OTT, and sometimes (mostly just for species level names) leave the bare name for the TNRS to find an OTT for me, via the |
Ah I see. So what you mean is that the more reliable way to map would be: QIDs --> gbif --> OTT. Yes, I agree with that. To work, this requires:
|
Exactly. It's worth avoiding matching on names where possible, although sometimes it's unavoidable. |
Also, as we discussed before, if we could get OTTs consistently in Wikidata, it would be simpler and more reliable than having to make an extra hop through gbif. Presumably, this could be automated for all OTTs, though it's hard to do in a completely correct way, given the taxon fuzziness. |
Yep. It would be more direct to add all the OTTs to wikidata instead, and we should probably aim to do this. |
Here is the state of OTT matching with the current extinct tree:
And note that synonyms are not always reliable. For instance, Sphenacodon (a Synapsid) is treated as a synonym of Epigonus (a fish), which seems utterly random.
This results in major issues with the tree:
Now, thinking about solving this...
In the full tree, we start with OTT, and then use the taxonomy to get ncbi, gbif, irmng, then the provider ID CSV to get the EOL IDs, and finally the wiki dump to get QIDs.
But in the Extinct case, we start with wikipedia, and essentially get everything from there. So this complex chain is unnecessary, and results in bad behavior due to the missing OTTs.
One solution to this is to simply move away from OTTs, and instead use QIDs as our primary identifiers. To do this in a non-disruptive way that doesn't require many core code changes, we can just pretend that QIDs are OTTs. So in the DB, we'd just put QIDs wherever OTTs are used today, without even changing the schema.
In the
ordered_leaves
table, this would effectively mean that theott
andwikidata
columns would have the same value.The reason I think this will work is that OneZoom mostly uses the OTT as a unique and stable ID for taxa, but doesn't really ever do anything that expects it to be a true OTT.
Bottom line is that we can potentially make all this work with zero changes to the core OneZoom code base. We only need to change the Extinct TreeBuild logic to create the database in this way.
I have not tried this, so let's discuss whether this might run into unexpected downsides.
The text was updated successfully, but these errors were encountered: