Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annif: Research backends #21

Open
hortongn opened this issue Apr 6, 2023 · 2 comments
Open

Annif: Research backends #21

hortongn opened this issue Apr 6, 2023 · 2 comments
Assignees

Comments

@hortongn
Copy link
Member

hortongn commented Apr 6, 2023

Learn more about backends in Annif and narrow in on the backend(s) we may want to use.

https://github.com/NatLibFi/Annif/wiki

@hortongn hortongn moved this from Triage to Todo in App Dev AI Project Apr 6, 2023
@hortongn hortongn changed the title Research backends Annif: Research backends Apr 13, 2023
@haitzlm haitzlm self-assigned this Apr 13, 2023
@scherztc
Copy link

@haitzlm
Copy link
Contributor

haitzlm commented Apr 14, 2023

**Back End Summaries:

  • TF-IDF: The TF-IDF (Term Frequency-Inverse Document Frequency) backend uses a statistical method to weight words in a document based on their frequency and relevance to the document. It can be used for document classification and information retrieval tasks.
  • fastText: The fastText backend is a deep learning model that is designed to handle text classification tasks. It uses a neural network architecture that can handle subword information, making it particularly effective for handling misspellings and out-of-vocabulary words.
  • Omikuji: The Omikuji backend is a probabilistic method that uses the Naive Bayes algorithm to classify documents. It is based on the Bag-of-Words model and is a relatively simple and lightweight method that can work well for small to medium-sized datasets.
  • MLLM: The MLLM (Maximum Likelihood Language Model) backend is a probabilistic method that models the probability of a document given its classification category. It is based on a language model that captures the statistical relationships between words in a document.
  • STWFSA: The STWFSA (Suffix Tree Weighted Finite State Automaton) backend is a method that uses a combination of suffix trees and finite state automata to represent and process text. It can be used for text classification and information retrieval tasks.
  • YAKE: The YAKE (Yet Another Keyword Extractor) backend is a text mining algorithm that identifies keywords and keyphrases in documents. It can be used to generate keywords that can then be used as input to other classification algorithms.
  • SVC: The SVC (Support Vector Classifier) backend is a popular machine learning algorithm that can be used for text classification tasks. It works by finding a hyperplane that separates the different classes in the feature space.
  • Fusion/Ensemble backends: These backends combine the results of other backends to generate a final classification result. For example, the Ensemble backend can combine the results of multiple backends using a simple voting scheme, while the nn_ensemble backend uses a neural network to learn the optimal combination of backends.
  • PAV: The PAV (Pool Adjacent Violators) backend is a method for constructing monotonic classification models. It works by finding a partition of the feature space that maximizes the difference in class probabilities between adjacent regions.

Special backends:

  • The HTTP backend allows Annif to interface with external services using a REST API.
  • The Dummy backend provides a simple baseline for comparison with other backends, by always returning the same classification result regardless of the input document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants