This is a list of real-world identifier issues encountered; it aims to be representative rather than exhaustive. This list could be used to
- Convince funders of the problem
- Provide a set of references for a paper or specification
- See what can be done to improve informatics/tooling around identifiers
We warmly welcome anyone to contribute.
Reported by | Reported about | Problems referenced | Problem category |
---|---|---|---|
EBI-Ontology Lookup Service (OLS) | various ontologies | underscore delimited vs colon-delimited forms, case sensitivity | search, delimiters |
Not clear | Darwin Core Triples | institutional code collisions amongst darwin core triples | collisions, institution identifiers |
PrefixCommons | NCBI | number of shortform and http URI permutations found in the wild for a single identifier in NCBI gene | data integration, text mining |
General (wikipedia entry) | Web-at-large | 17 different ways in which URLs could be determined to be equivalent; some of these are lossy | data integration |
biostars | HGNC | Mapping between similar entities across databases | mapping |
Human Phenotype Ontology | OMIM | Prefix heterogeneity OMIM vs MIM. Have to build special processors to collapse them | prefix variation, data integration |
Monarch Initiative | TAIR | TAIR prefix variation difficult to resolve | type-specificity |
Stian | EU grants | No obvious documentation for permalinks in EU grants, nor any correlation between destination URL and project ID | documentation |
H pylori paper | HP Protein identifiers | Naming problems that result from embedded meaning in identifiers and evolving scientific knowledge. | Embedded meaning |
PrefixCommons | HGNC | co-occuring identifier complexities in HGNC (multiple entity types, multiple identifier types, prefixed/unprefixed versions, type-specific URLs without type-specific determinism in local IDs) | type-specificity |
WebProNews | EBAY | need for location-independent ids | data integration |
PrefixCommons | ZENODO | No rollup to impact for all DOI versions | DOI versions |
Monarch Initiative | Monarch's ingest of FlyBase | Faulty ingest process resulted in fly and human genes being considered equivalents instead of orthologs. | Data integration |
Monarch Initiative | EBI-OLS | Tricky to support searches of identifiers because of standard query-parsing behavior of solr. | Data applications |
Ziemann et al | Several journals | Gene name corruption in supplementary data affects 20% of papers | Data quality |
D. Natale | NCBI's Gene database | Large number of identifiers went stale for strains declared "out of scope" or other reasons. In some cases no alternative is offered. Example 1 https://www.ncbi.nlm.nih.gov/gene/?term=5203950. Example 2 https://www.ncbi.nlm.nih.gov/gene/?term=1165308 | data stability |
Monarch Initiative | Massive DB | hashed links like http://massive.ucsd.edu/ProteoSAFe/result.jsp?task=f847302a49e34ab89ebf3ecc2250be96&view=advanced_view, especially when surrounded by a lot of implementation-specific cruft, do not inspire confidence. They appear even as though they may be session-specific. There are local IDs that are supported in more deterministic URIs; however these are virtually unfindable except through trial and error: eg. https://gnps.ucsd.edu/ProteoSAFe/dataset_id_redirect.jsp?massiveid=MSV000079621 | persistence, documentation |
Monarch Initiative | Incoming links | Other sites are linking to us but in ways that have different conventions about leading zeros, eg. https://monarchinitiative.org/disease/DOID:0050202 isn't correctly formed and leads to 404. | persistence, integration |
Gene Ontology | Duplicated prefixes in EBI RDF platform | Prefixes for GO Ids are double encoded and 404 (EBISPOT/RDF-platform#3) | persistence, integration |
Monarch Initiative | OMIM links to ClinicalTrials.gov | Lack of identified 'hooks' into clinicaltrials.gov means that searching for an entity leads to false positives | integration |
Monarch Initiative | Link Rot | Ruins Halloween | Persistence |
Gene Ontology | Gene Ontology xrefs | Russian-doll nesting of id minting authorities | integration |
Monarch Initiative | Prefix collision | FB is used as a prefix for FaceBase and for FlyBase | Integration |