Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Preprocessing, additional Export Formats and Options #85

Open
FrankGrimm opened this issue Oct 30, 2020 · 0 comments
Open

Dataset Preprocessing, additional Export Formats and Options #85

FrankGrimm opened this issue Oct 30, 2020 · 0 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@FrankGrimm
Copy link
Owner

FrankGrimm commented Oct 30, 2020

Meta-issue to prepare the system for external ML components #27

The CSV export that already exists but should be improved upon for the integration of external ML components.

Additional export formats

  • metadata (JSON)
    • task metadata description
    • vocabulary table
  • JSON
  • ARFF
  • libsvm

Make sure this is flexible enough to support conll-u later on for token level tasks (this might require to hard code a few assumptions on preprocessing and separate tokenization from generic feature extraction)

All implementations should allow for bulk downloads, as well as API access with pagination.

Filters:

  • all annotations vs. curated gold annotations only
  • all fields vs. sample_index, id column, text column, gold annotation only
  • preprocessed vs raw text vs both

Data model changes

  • add DatasetContent fields to store preprocessed values (e.g. tokenized text)
  • add separate table for a feature vocabulary (in combination with the above this builds a simple feature store), include metrics like tf and df.
  • add inherited Annotation sub-class that stores model predictions and confidence values (or uncertainty measures), might require separate entity if we want to also store topic model or clustering results
  • figure out what to do with model artifacts for continuous learning scenarios
@FrankGrimm FrankGrimm added the enhancement New feature or request label Oct 30, 2020
@FrankGrimm FrankGrimm added this to the v1.4.x milestone Oct 30, 2020
@FrankGrimm FrankGrimm self-assigned this Oct 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant