Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name on page incorrectly identified #65

Open
mlichtenberg opened this issue Dec 29, 2020 · 3 comments
Open

Name on page incorrectly identified #65

mlichtenberg opened this issue Dec 29, 2020 · 3 comments

Comments

@mlichtenberg
Copy link

The following feedback was reported by a BHL user:

"https://www.biodiversitylibrary.org/page/2334415 claims that Gastrochaena (Spengleri) Tryon 1862 is a species shown on the page. Gastrochaena is not present; this is a wrong guess based on the species name spengleri. Several of the scientific names actually on the page are not found. I'm sure that having ' in the middle of most of them in the original text does not help locate them. But the program inventing false hits is definitely not a good thing. The rate of false hits might be decreased somewhat if it knew not to look for names in papers published before the name was published."

The OCR text in question is located at https://www.biodiversitylibrary.org/pagetext/2334415. I have confirmed that these results are coming from gnfinder, and are not left over from previous name-finding tools.

@dimus
Copy link
Member

dimus commented Jan 3, 2021

The problem arises from the following verification of "Spengleri":

{
  "inputId": "ddef4511-d05e-5cca-93a1-711b6e5d6451",
  "input": "Spengleri",
  "matchType": "Exact",
  "bestResult": {
    "dataSourceId": 172,
    "dataSourceTitleShort": "PaleoBioDB",
    "curation": "Curated",
    "recordId": "62300",
    "entryDate": "2020-06-05",
    "matchedName": "Gastrochaena (Spengleri) Tryon 1862",
    "matchedCardinality": 1,
    "matchedCanonicalSimple": "Spengleri",
    "matchedCanonicalFull": "Gastrochaena subgen. Spengleri",
    "currentRecordId": "62300",
    "currentName": "Gastrochaena (Spengleri) Tryon 1862",
    "currentCardinality": 1,
    "currentCanonicalSimple": "Spengleri",
    "currentCanonicalFull": "Gastrochaena subgen. Spengleri",
    "isSynonym": false,
    "editDistance": 0,
    "stemEditDistance": 0,
    "matchType": "Exact"
  },
  "dataSourcesNum": 1,
  "curation": "Curated"
}

It happens as a result of an existence of a subgenus "Spengleri", with a parent genus of "Gastrochaena". We can see that the word "Spengleri" does exist on a page, as a result we do find the name. One example of the name is at https://www.mindat.org/taxon-P62300.html

It is a common practice to write a subgenus name in a format of 'Genus (Subgenus)'.

I suspect that seeing so much additional information for a false positive is confusing for BHL users, and sadly, we will have false positives always (hopefully less and less of them with time). May be it would be better to provide just a canonical form of the name "Spengleri" instead of a full name that did match "Gastrochaena (Spengleri) Tryon 1862"?

@dimus
Copy link
Member

dimus commented Jan 3, 2021

Even if a name verified is "real", we can never guarantee, that verification's best result is not a homonym, and totally different name was mentioned on the page. This is another reason to give only a simple canonical form as an output instead of the full name from the best result.

@dimus
Copy link
Member

dimus commented Jan 3, 2021

Using a year to filter out some false positives would probably help. However to make it helpful instead of introducing more false negatives and false positives is not a simple task, taking in account that for many BHL items the year is not known, or not known for sure, or is an input error. Years that we get from names verification is also a shaky ground because of homonyms. So introducing filtering by year would require a lot of thought.

See also https://github.com/gnames/gnumsfind

It is a project that I started to research finding years and page numbers in texts. It might help to a degree with determination of probable year-ranges for items without years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants