Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible categorization error: dataset try #270

Open
beatrizmilz opened this issue Apr 4, 2024 · 0 comments
Open

Possible categorization error: dataset try #270

beatrizmilz opened this issue Apr 4, 2024 · 0 comments

Comments

@beatrizmilz
Copy link

beatrizmilz commented Apr 4, 2024

Hello!

OTN is a great project, thank you all for it.

This issue aims to document a possible error in the "resolved" categorization.

While using the dataset, Thiago @thiago-goncalves-souza and I noticed a possible categorization error on the try dataset (https://opentraits.org/datasets/try).

If we filter OTN to get only rows that are from the try dataset AND Animalia Kingdom (resolveKingdomName == "Animalia"), we get more than 5k rows.

# download data from
# https://github.com/open-traits-network/otn-taxon-trait-summary/blob/main/traits.csv.gz
otn_raw <-
  readr::read_csv("traits.csv")

otn_dataset_try <- otn_raw |>
  # filter only the animal kingdom
  dplyr::filter(resolveKingdomName == "Animalia") |>
  dplyr::filter(datasetId == "https://opentraits.org/datasets/try")


dplyr::glimpse(otn_dataset_try)
# Rows: 5,311
# Columns: 31
# $ taxonIdVerbatim        <chr> "1669", "1669", "1669", "1669", "1669", "1…
# $ scientificNameVerbatim <chr> "Agathis philippinensis", "Agathis philipp…
# $ resolvedTaxonId        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ resolvedTaxonName      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ parentTaxonId          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ family                 <chr> "Araucariaceae", "Araucariaceae", "Araucar…
# $ phylum                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ traitIdVerbatim        <dbl> 37, 3400, 759, 98, 3401, 43, 22, 17, 4, 38…
# $ traitNameVerbatim      <chr> "Leaf phenology type", "Plant growth form …
# $ bucketId               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ bucketName             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ counts                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ datasetId              <chr> "https://opentraits.org/datasets/try", "ht…
# $ numberOfRecords        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 3, …
# $ curator                <chr> "https://opentraits.org/members/brian-s-ma…
# $ accessDate             <date> 2022-08-19, 2022-08-19, 2022-08-19, 2022-…
# $ comment                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ relationName           <chr> "HAS_ACCEPTED_NAME", "HAS_ACCEPTED_NAME", …
# $ resolvedExternalId     <chr> "COL:6635V", "COL:6635V", "COL:6635V", "CO…
# $ resolvedName           <chr> "Agathis philippinensis", "Agathis philipp…
# $ resolvedRank           <chr> "species", "species", "species", "species"…
# $ resolvedCommonNames    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
# $ resolvedPath           <chr> "Biota | Animalia | Arthropoda | Insecta |…
# $ resolvedPathIds        <chr> "COL:5T6MX | COL:N | COL:RT | COL:H6 | COL…
# $ resolvedPathNames      <chr> "unranked | kingdom | phylum | class | ord…
# $ resolvedExternalUrl    <chr> "https://www.catalogueoflife.org/data/taxo…
# $ resolveKingdomName     <chr> "Animalia", "Animalia", "Animalia", "Anima…
# $ resolvedPhylumName     <chr> "Arthropoda", "Arthropoda", "Arthropoda", …
# $ resolvedFamilyName     <chr> "Braconidae", "Braconidae", "Braconidae", …
# $ providedTraitName      <chr> "Leaf phenology type", "Plant growth form …
# $ resolvedTraitName      <chr> "Phenology", "Morphology", "UNCATEGORIZED_…

But some of the traits seems like they are from plants:

otn_dataset_try |>
  dplyr::count(datasetId,
               resolveKingdomName,
               providedTraitName,
               sort = TRUE) |> 
  head() 
datasetId resolveKingdomName providedTraitName n
https://opentraits.org/datasets/try Animalia Plant growth form 482
https://opentraits.org/datasets/try Animalia Leaf type 257
https://opentraits.org/datasets/try Animalia Leaf compoundness 255
https://opentraits.org/datasets/try Animalia Plant woodiness 255
https://opentraits.org/datasets/try Animalia Leaf phenology type 178
https://opentraits.org/datasets/try Animalia Leaf area (in case of compound leaves: leaflet 161

Here are some of the most frequent categories that appear in resolvedPhylumName/resolvedName from this query:

otn_dataset_try |>
  dplyr::count(datasetId,
               resolveKingdomName,
               resolvedPhylumName,
               resolvedName,
               sort = TRUE) |> 
  head()
datasetId resolveKingdomName resolvedPhylumName resolvedName n
https://opentraits.org/datasets/try Animalia Mollusca Ficus 162
https://opentraits.org/datasets/try Animalia Chordata Salix 118
https://opentraits.org/datasets/try Animalia Arthropoda Eugenia 117
https://opentraits.org/datasets/try Animalia Arthropoda Inga 117
https://opentraits.org/datasets/try Animalia Arthropoda Viola 94
https://opentraits.org/datasets/try Animalia Chordata Phyllanthus 88
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant