The Apache OpenNLP library provides binary models for processing of natural language text. This repository is intended for the distribution of model files as a Maven artifacts.
For additional information, visit the OpenNLP Home Page.
You can use OpenNLP with many languages. Additional demo models are provided here.
The models are fully compatible with the latest OpenNLP release. They can be used for testing or getting started.
Note
Please train your own models for all other, specialized use cases.
Documentation, including JavaDocs, code usage and command-line interface examples are available here
You can also follow our mailing lists for news and updates.
We provide Tokenizer, Sentence Detector and Part-of-Speech Tagger models for the following 32 languages:
- Armenian
- Basque
- Bulgarian
- Catalan
- Croatian
- Czech
- Danish
- Dutch
- English
- Estonian
- Finnish
- French
- Georgian
- German
- Greek
- Icelandic
- Italian
- Kazakh
- Korean
- Latvian
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Serbian
- Slovak
- Slovenian
- Spanish
- Swedish
- Turkish
- Ukrainian
These models are compatible with OpenNLP >= 1.0.0
. Further details are available at the OpenNLP Models
page and in the CHANGELOG.
In addition, we provide a Language Detector, which is able to detect 103 languages in ISO 693-3 standard. Works well with longer texts that have at least 2 sentences or more from the same language.
It is compatible with OpenNLP >= 1.8.3
. Model details are available here.
The Universal Dependencies (UD) community provides a framework for consistent annotation of grammar across different human languages. The project is developing cross-linguistically consistent treebank annotation for 150+ languages.
You can import UD-based model artifacts directly via Maven, SBT or Gradle, for instance:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-models-pos-de</artifactId>
<version>${opennlp.models.version}</version>
</dependency>
for all 32 supported languages, listed on the Apache OpenNLP Model page.
The broader langdetect model can be referenced like this:
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-models-langdetect</artifactId>
<version>${opennlp.models.version}</version>
</dependency>
libraryDependencies += "org.apache.opennlp" % "opennlp-models-langdetect" % "${opennlp.version}"
compile group: "org.apache.opennlp", name: "opennlp-models-langdetect", version: "${opennlp.version}"
For more details please check our documentation
All released sentence detection, tokenization, lemmatizer, and POS tagging models were and can be trained via the ud-train.sh
script.
It is located in the opennlp-models-training-ud directory in this repository.
Before training UD-based OpenNLP models, the execution environment needs the latest OpenNLP release and the latest set of UD treebanks.
Download the corresponding archive files and uncompress them both in the same directory in which the training script resides.
Rename both folders according to the OPENNLP_HOME
and UD_HOME
variables.
Important
Check and adjust the version string in both variables, that is, to the versions you have actually downloaded.
Next, select what type of models should be trained. By default, the script defines:
TRAIN_TOKENIZER="true"
TRAIN_POSTAGGER="true"
TRAIN_SENTDETECT="true"
TRAIN_LEMMATIZER="true"
Simply switch off a certain type, by setting the corresponding variable to false.
By default, treebanks of 32 supported languages are included in the MODELS
variable of the script.
If only a smaller or different (sub-)set is required, this variable can simply be edited.
The format must be followed: <Language>|<2-digit-locale-code>|<UD treebank name>
, for example: English|en|EWT
or Swedish|sv|Talbanken
.
Note
The full list of supported languages and related treebanks is available here. Yet, even listed on the UD page, training OpenNLP models might not succeed. If it succeeds, check the evaluation logs (*.eval) if the computed accuracy meets your expectations.
Once you're done with the preparations, check the ud-train.conf
file. With this config file, you can adjust the number of threads used for certain training steps.
Moreover, it is possible to adjust the number of iterations (default: 150) to achieve (slightly) better model performance.
Make sure to make the ud-train.sh
script executable.
On Unix-oid environments this can simply be achieved by setting the execute bit: chmod 744 ud-train.sh
.
Tip
As model training(s) can be a long-running task, depending on CPU type and number of CPU cores,
the script should be started inside a screen
instance.
Finally, execute the script via invoking ./ud-train.sh
and start brewing and enjoying some ☕.
The script logs each training (and evaluation) step per selected language / treebank, thus allowing progress tracking.
After a training step succeeds, a corresponding evaluation step is executed. If you want to skip it, set EVAL_AFTER_TRAINING
to false
.
In case the evaluation is run, the resulting performance (accuracy) is written to files ending with .eval
.
When adding new models to the pom.xml
, ensure to add new models to the expected-models.txt
file located in opennlp-models-test
.
In addition, make sure a sha256 hash is computed on each binary artifact.
The corresponding value must be set or updated correctly for each model type and language.
The Apache OpenNLP project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component.
If you would like to get involved please follow the instructions here