Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data #1

Open
seltmann opened this issue Jun 2, 2015 · 3 comments
Open

Data #1

seltmann opened this issue Jun 2, 2015 · 3 comments

Comments

@seltmann
Copy link
Member

seltmann commented Jun 2, 2015

@jhpoelen @mjcollin I added a small dataset of parasitioid data here. I can provide messier data as well, but thought this might be a good start. Its the aec:associatedFloraNotes and aec:associatedFaunaNotes fields that are the verbatim text from the labels.

@jhpoelen
Copy link

jhpoelen commented Jun 3, 2015

thanks @seltmann . . . hoping to put it to use (at least as an example) in the next couple of days. Do you happen to have a method paper of sorts that describes your text extraction process?

@seltmann
Copy link
Member Author

seltmann commented Jun 3, 2015

@jhpoelen no I dont have a methods paper, although the general idea I tried was the same as the Plos ONE, "Utilizing Descriptive Statements from BHL" paper. Its semi-manual, as proposed structures are offered to a person to review, and somehow the script "learns" from the decisions the person makes. I did this in a very rudimentary way by creating dictionaries of meaningful words, and stop word lists.

I first went through the notes and create a dictionary of words/definitions for the association relationships based on the dataset I am using (since all note sets are somewhat different depending on taxa). That was done by randomly picking notes and having a person define what part of the note was meaningful, and its definition (or mapping to an ontology term). Ex. is latin for "from", and I think in the majority of cases particularly with parasitiods it can be assumed that would map to "emerged_from". Cool thing is that there is actually a great deal of repetition, so it takes very few instances in the dictionary, but the challenge then comes with all of the variations on the dictionary entry. So Ex. could be entered as "Ex ","Ex.","ex", or "ex." for example. If it is only a scientific name, or the default is the generic "associates_with".

A second dictionary was created based on insect scientific names (host names) and plant scientific names. I used taxon names we had in our database, but I suspect a name service could help with this, although I did not try to incorporate one.

I hard coded some observations I know about how folks write host information on labels. So if a person uses ex. the following words are most often the name of the host, and that host is an insect for this dataset based on its biology. It would be very interesting to have some learning from the script in label structure as well, but I never went that far.

@debpaul
Copy link

debpaul commented Jun 3, 2015

Hi Katja,

Wish you where here at hackathon!
See list of pitches...we're working on right now.

https://docs.google.com/document/d/1ushqk5r5llQVVEcYNhOehiVIVHZG51vbG-Yau1L1oHY/edit?usp=sharing

:-)
Deb

On 6/3/2015 9:11 AM, Katja Seltmann wrote:

@jhpoelen https://github.com/jhpoelen no I dont have a methods
paper, although the general idea I tried was the same as the Plos ONE,
"Utilizing Descriptive Statements from BHL" paper. Its semi-manual, as
proposed structures are offered to a person to review, and somehow the
script "learns" from the decisions the person makes. I did this in a
very rudimentary way by creating dictionaries of meaningful words, and
stop word lists.

I first went through the notes and create a dictionary of
words/definitions for the association relationships based on the
dataset I am using (since all note sets are somewhat different
depending on taxa). That was done by randomly picking notes and having
a person define what part of the note was meaningful, and its
definition (or mapping to an ontology term). Ex. is latin for "from",
and I think in the majority of cases particularly with parasitiods it
can be assumed that would map to "emerged_from". Cool thing is that
there is actually a great deal of repetition, so it takes very few
instances in the dictionary, but the challenge then comes with all of
the variations on the dictionary entry. So Ex. could be entered as "Ex
","Ex.","ex", or "ex." for example. If it is only a scientific name,
or the default is the generic "associates_with".

A second dictionary was created based on insect scientific names (host
names) and plant scientific names. I used taxon names we had in our
database, but I suspect a name service could help with this, although
I did not try to incorporate one.

I hard coded some observations I know about how folks write host
information on labels. So if a person uses ex. the following words are
most often the name of the host, and that host is an insect for this
dataset based on its biology. It would be very interesting to have
some learning from the script in label structure as well, but I never
went that far.


Reply to this email directly or view it on GitHub
#1 (comment).

-- Upcoming iDigBio Events https://www.idigbio.org/calendar
-- Deborah Paul, iDigBio Technology Specialist
Institute for Digital Information, 234 LSB
Florida State University
Tallahassee, Florida 32306
850-644-6366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants