Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing OTTs result in missing images and links in the Extinct tree #103

Open
davidebbo opened this issue Nov 12, 2024 · 21 comments
Open

Missing OTTs result in missing images and links in the Extinct tree #103

davidebbo opened this issue Nov 12, 2024 · 21 comments
Labels
extinct tree Issues involving extinct species

Comments

@davidebbo
Copy link
Collaborator

Here is the state of OTT matching with the current extinct tree:

Over all input files, 73.849% names were precisely matched, 4.358% matched via synonyms, and 21.792% have no match

And note that synonyms are not always reliable. For instance, Sphenacodon (a Synapsid) is treated as a synonym of Epigonus (a fish), which seems utterly random.

This results in major issues with the tree:

  • Missing images
  • Missing links, or link going to the wrong place when you click on the leaf

Now, thinking about solving this...

In the full tree, we start with OTT, and then use the taxonomy to get ncbi, gbif, irmng, then the provider ID CSV to get the EOL IDs, and finally the wiki dump to get QIDs.

But in the Extinct case, we start with wikipedia, and essentially get everything from there. So this complex chain is unnecessary, and results in bad behavior due to the missing OTTs.

One solution to this is to simply move away from OTTs, and instead use QIDs as our primary identifiers. To do this in a non-disruptive way that doesn't require many core code changes, we can just pretend that QIDs are OTTs. So in the DB, we'd just put QIDs wherever OTTs are used today, without even changing the schema.

In the ordered_leaves table, this would effectively mean that the ott and wikidata columns would have the same value.

The reason I think this will work is that OneZoom mostly uses the OTT as a unique and stable ID for taxa, but doesn't really ever do anything that expects it to be a true OTT.

Bottom line is that we can potentially make all this work with zero changes to the core OneZoom code base. We only need to change the Extinct TreeBuild logic to create the database in this way.

I have not tried this, so let's discuss whether this might run into unexpected downsides.

@davidebbo davidebbo added the extinct tree Issues involving extinct species label Nov 18, 2024
@davidebbo
Copy link
Collaborator Author

I switched to QIDs and it works very well. The big downside is that the extinct tree then uses IDs that don't match that of the main tree, which can add friction when linking from one to the other. For now, it is still a good approach, but it's worth discussing alternatives.

One approach is to simply leave those nodes without OTTs, and fix all the OneZoom issues that happen in the absence of OTTs:

The downside is that it may be a fair be of work to fix everything properly, and without causing perf or functionality regressions.

Another approach is to use whatever real OTTs we can get, and then use 'fake OTTs' for the missing ones. Sub-approaches:

  • Make up OTTs and persist them so they are stable. Use a number range that can't conflict with real OTTs. Possibly make them negative.
  • Use QIDs as missing OTTs. Possibly make them negative.

Summarize, we have the following options:

  1. Use QIDs throughout
  2. Make the site work properly without OTTs
  3. Use generated OTTs when there are no OTTs
  4. Use generated QIDs when there are no OTTs

/cc @hyanwong @jrosindell @lentinj

@lentinj
Copy link
Collaborator

lentinj commented Nov 19, 2024

I'm nervous about switching to QIDs throughout, It'll make a mess of the code base that would take a long time to recover from. Hearsay, but friends have been burnt by assuming these are stable before.

Most of the issues @davidebbo has found are things that would be better off fixed, ignoring the extinct tree. I'd definitely prefer if we allowed @_qid=x as a valid pinpoint form, and fix each by:

  • #903: Get image by OZID/metacode instead of OTT (we'd have to restructure the database to allow this though, this is definitely the biggest one out of the set)
  • #904: Finding the wonky logic (there's various points clientside which make odd assumptions along these lines)
  • #905: leaf_linkouts accepts a pinpoint, returns a pop=wp_qid pop-up.
  • #907: Consider nodes with OTT or QID first, when deciding what goes in the URL. Ideally do #547 so we no longer need this code in the first place.

...but this is a pretty big side-quest.

Negative OTT for the extinct tree is a bodge I can get behind. I doubt it'd just work though. We'd need some extra logic in places it really needs to be an OTT, not just a internal identifier. There are likely to be lurking regexes that assume @([0-9]+)=(.+) too.

@davidebbo
Copy link
Collaborator Author

I'm nervous about switching to QIDs throughout, It'll make a mess of the code base that would take a long time to recover from

To clarify, the 'switching to QIDs' that I'm using doesn't touch the code base at all, as it just jams QIDs in the OTT columns. See the first comment in this issue, which has more details. It's a cheap point-in-time solution to get around all those issues in the extinct tree for now.

We'd need some extra logic in places it really needs to be an OTT, not just a internal identifier

Is there in fact anywhere that expects it to be an actual OTT?

@lentinj
Copy link
Collaborator

lentinj commented Nov 19, 2024

Is there in fact anywhere that expects it to be an actual OTT?

The OpenTree link when in expert mode does, but I can't think of much else.

@davidebbo
Copy link
Collaborator Author

The OpenTree link when in expert mode does, but I can't think of much else.

Didn't know there was an expert mode! But yeah, I found it looking at sources: https://www.onezoom.org/life_expert/, and I see the Open Tree tab. Anyway, I'm not concerned about this being broken for the extinct tree for now.

I just thought of an alternate solution: give a list of missing taxa to our OT friends and ask them to create OTTs for them. After all, creating OTT IDs should be cheap and easy. It's placing them in the tree that's hard, and at this point, we don't need them to be. Or maybe I'm being naive about this? @hyanwong do you think this is worth pursuing?

@hyanwong
Copy link
Member

give a list of missing taxa to our OT friends and ask them to create OTTs for them

A good thing there would be to incorporate the extinct studies that have those taxa in their definition. I think the study curator allows you to coin new taxa: I don't know if those get allocated new OTTs, but I presume they do. I don't know if there is a UI for adding OTTs, but perhaps there should be (or I suspect you can add them as exceptions into an OpenTree GH repo). Do these missing taxa have NCBI or GBIF identifiers on wikidata (I presume they at least have a GBIF).

@jrosindell is chatting to Emily Jane McTavish, and could probably ask her.

@davidebbo
Copy link
Collaborator Author

davidebbo commented Nov 19, 2024

Do these missing taxa have NCBI or GBIF identifiers on wikidata

I looked at the first 3 and they dont have either:

I'm not sure what to make of it. They could be bogus taxa, or those DBs might just be incomplete. But googling any of these 3 taxa yields quite a few results, leading me to think that they are at least somewhat well recognized.

@davidebbo
Copy link
Collaborator Author

Actually, 2 of these are in gbif. It's just that wikidata doesn't have the link:

@hyanwong
Copy link
Member

hyanwong commented Nov 19, 2024

OK, well, I guess we can add them to wikidata ourselves! It might be tedious to do this by hand, but perhaps we can automate it.

@davidebbo
Copy link
Collaborator Author

OK, well, I guess we can add them to wikidata ourselves! It might be tedious to do this by hand, but perhaps we can automate it.

Ideally, gbif would run a process that adds their IDs to all wikidata entries that don't have one, rather than having everyone fix up a few for their own needs.

But stepping back, for our needs, we don't really need gbif to be in wikidata. We mostly just need stable OTT IDs to exist for all taxa that we're using.

@hyanwong
Copy link
Member

hyanwong commented Nov 19, 2024

We mostly just need stable OTT IDs to exist for all taxa that we're using.

Yes, but ideally we need a way to map the OTT to wikipedia, and my guess for those missing taxa is that will be via a GBIF identifier

@davidebbo
Copy link
Collaborator Author

Yes, but ideally we need a way to map the OTT to wikipedia

In the extinct tree, it is not necessary since it all starts from Wikipedia, so we always have that. That is how I'm able to use QIDs as primary identifier today, and have everything work well.

When it comes down to it, the only reason we need an OTT is to avoid all the issues listed above when we don't have any.

Now obviously, having all those additional mappings like gbif available would be a good thing in general, but they are not blockers.

@hyanwong
Copy link
Member

Yes, but ideally we need a way to map the OTT to wikipedia

In the extinct tree, it is not necessary since it all starts from Wikipedia, so we always have that.

Ah, right. But then we need to have a (non-taxon-name) way to map from Wikidata back to OTT, which will probably be via a GBIF identifier, right?

@davidebbo
Copy link
Collaborator Author

Ah, right. But then we need to have a (non-taxon-name) way to map from Wikidata back to OTT

I don't think we do. If the OTT exists, we'll have both the OTT (coming from add_ott_numbers_to_trees), and the QID (which we get intrinsically from being Wiki-first). So the mapping is there without needing GBIF.

@hyanwong
Copy link
Member

Ah, so you mean we are relying on the OpenTree TNRS to do the mapping. I see. It's sometimes good to try to avoid that where possible, but I guess that's the norm in the extinct tree.

@davidebbo
Copy link
Collaborator Author

Ah, so you mean we are relying on the OpenTree TNRS to do the mapping. I see. It's sometimes good to try to avoid that where possible, but I guess that's the norm in the extinct tree.

I'm confused. Isn't OpenTree TNRS what we're already using for the extant tree? That's what add_ott_numbers_to_trees does, which is original code, and is used in the regular tree build process.

Maybe easier the chat directly in the next meeting?

@hyanwong
Copy link
Member

hyanwong commented Nov 19, 2024

We only use the TNRS for taxa that are "hand-crafted". The vast majority of the tree is straight from OpenTree, so has the OTTs embedded in the tree data, and we don't need to look up names. This is useful, because scientific names (especially above the species level) are not guaranteed to be unique: E.g. there are 3 or 4 taxa called "Ctenophora": a genus of flies, a phylum of early diverging animals, and something else that I forget. Even with species names we can have overlap: Pieris japonica is both a butterfly (I think) and a bush (this example seems to have changed: there's a more up-to-date list at https://species.wikimedia.org/wiki/Category:Species-level_hemihomonyms, including e.g. Ficus variegata): these are called hemihomonyms.

In the parts of the tree that we aren't getting from OpenTree, I sometimes hard-code the OTT, and sometimes (mostly just for species level names) leave the bare name for the TNRS to find an OTT for me, via the add_ott_numbers_to_trees script (the hand-crafted files often specify a higher-level group within which to look for names, to avoid the Ficus variegata sort of problem).

@davidebbo
Copy link
Collaborator Author

Ah I see. So what you mean is that the more reliable way to map would be: QIDs --> gbif --> OTT.

Yes, I agree with that. To work, this requires:

  • The OTT to exist
  • The Taxonomy file to map it to a gbif
  • That gbif to exist in the Wikidata entry

@hyanwong
Copy link
Member

Exactly. It's worth avoiding matching on names where possible, although sometimes it's unavoidable.

@davidebbo
Copy link
Collaborator Author

Also, as we discussed before, if we could get OTTs consistently in Wikidata, it would be simpler and more reliable than having to make an extra hop through gbif. Presumably, this could be automated for all OTTs, though it's hard to do in a completely correct way, given the taxon fuzziness.

@hyanwong
Copy link
Member

Yep. It would be more direct to add all the OTTs to wikidata instead, and we should probably aim to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extinct tree Issues involving extinct species
Projects
None yet
Development

No branches or pull requests

3 participants