-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
best way to use gnfinder to find names in tabulated data and get results tabulated as in origin #120
Comments
Hi @abubelinha, one way you can do it locally is to set a pipe in python to talk to command liine gnfinder on you computer. It would be similar to https://github.com/gnames/gnparser#pipes 2500 separate calls to API also does not sound too strenuous for the service. |
Thanks @dimus Anyway, I had not realized that gnfinder returns start/end position of each name found in the long text string. That could be so useful for my use case. |
If you do not mind to use the start/end positions, all should work in one go. However, take in account #38. If your file is tab-separated, all will work, if it is comma-separated, you would probably need to preprocess the file and add a space after commas. |
Good point! "Blah blah blah Scientificname_A blah blah Scientificname_B blah blah I use I try to figure out what will happen if the taxon name is just at the end or beginning of the cell (if no separator is added, then both names will be concatenated). Perhaps a space before and after separator would be better? (so 3 characters instead of just one) |
originally gnfinder was made to detect names in BHL, so it uses a space of any kind as a separator between words. The |
several spaces are ok |
CSV and TSV files should work fine, because they are going to be normalized to a plain text with spaces. |
Hello
I am planning to use gnfinder to process a column from a table with about 2500 rows.
So, in fact, what I need to pass in to gnfinder is each cell of the second column, to extract names from it and return matches against some preferred name sources. But of course, I need to keep the returned info associated to each specimen ID (1st column in my table).
I was planning to use the API but I suppose I could try to use the CLI if it is more suitable to this purpose.
Thanks a lot
EDIT: not sure if this has relation to #56 but I am not using R dataframes. Just processing a CSV file in Python.
The text was updated successfully, but these errors were encountered: