NoSketch Engine is an open-source project combining Manatee and Bonito and Crystal into a powerful and free corpus management and search system. It does not provide however word sketches, thesaurus, keyword computation etc. like the commercial offering https://www.sketchengine.eu/ .
For further information, source and binary packages see https://nlp.fi.muni.cz/trac/noske
You need to provide the verticals for your corpus or corpora and configuration files.
On startup the process for compiling the verticals into data files NoSkE uses is automatically triggered.
The list of the corpora that should be compiled on startup is supplied using CORPLIST
environment variable.
You usually mount them into the container like
docker run --rm -it -v Q:\path\to\your\verticals:/var/lib/manatee/data/verticals -v Q:\path\to\your\configuration-files:/var/lib/manatee/registry -p 8080:8080 -e CORPLIST=my_corpus ghcr.io/acdh-oeaw/noske-ubi9/noske:5.71.15-2.225.8-2-open
or on Linux/MacOS
docker run --rm -it -v $(pwd)/verticals:/var/lib/manatee/data/verticals -v $(pwd)/configuration-files:/var/lib/manatee/registry -p 8080:8080 -e CORPLIST=my_corpus ghcr.io/acdh-oeaw/noske-ubi9/noske:5.71.15-2.225.8-2-open
On Kubernetes you can use a Persistent Volume Claim for the verticals and a config map for the configuration/registry for example. You probably need to populate the verticals using some service pod or initializer.
A corpus registry/configuration file could start with something like the following:
NAME "my_corpus"
## Path where to store the index files
PATH "/var/lib/manatee/data/my_corpus"
## Path where to find vertical files
VERTICAL "| ls /var/lib/manatee/data/verticals/my_corpus/*.txt | xargs cat"
[... Attribute definitions etc. ]
List the configured corpora
CORPLIST
: The list of corpora you configured and want to serve (space separated).
Logging:
If you want to define where to put log files you can set two environment variables
HTTPD_ERROR_LOGFILE
: path to some log fileHTTPD_ACCESS_LOGFILE
: path to some log file
Note that /dev/stdout
and /dev/stderr
do not work. Something similar to /dev/stdout is the default for the access log.
If you need authentication you can configure BASIC authentication by creating first and then mounting a htpasswd file. These two environment variables need to be set:
HTPASSWD_FILE
: where lighttpd will find that file. For example/var/lib/bonito/htpasswd
PASSWD_REALM
: Something that identifies this instance (will show in the password prompt of browsers)
If you enable authentication then every user can have their own settings for the UI. Also this enables users to create their own private subcorpora in the UI. These have to be saved somewhere so two additional directories need to be mounted to the container:
docker run --rm -it -v Q:\path\to\your\verticals:/var/lib/manatee/data/verticals `
-v Q:\path\to\your\configuration-files:/var/lib/manatee/registry `
-v Q:\path\to\your\htpasswd:/var/lib/bonito/htpasswd `
-v Q:\path\to\your\users-subcorpora:/var/lib/bonito/subcorp `
-v Q:\path\to\your\users-options:/var/lib/bonito/options `
-p 8080:8080 -e CORPLIST=my_corpus `
-e HTPASSWD_FILE=/var/lib/bonito/htpasswd -e PASSWD_REALM=my_noske `
ghcr.io/acdh-oeaw/noske-ubi9/noske:5.71.15-2.225.8-2-open
or on Linux/MacOS
docker run --rm -it -v $(pwd)/verticals:/var/lib/manatee/data/verticals \
-v $(pwd)/configuration-files:/var/lib/manatee/registry \
-v $(pwd)/htpasswd:/var/lib/bonito/htpasswd \
-v $(pwd)/users-subcorpora:/var/lib/bonito/subcorp \
-v $(pwd)/users-options:/var/lib/bonito/options \
-p 8080:8080 -e CORPLIST=my_corpus \
-e HTPASSWD_FILE=/var/lib/bonito/htpasswd -e PASSWD_REALM=my_noske \
ghcr.io/acdh-oeaw/noske-ubi9/noske:5.71.15-2.225.8-2-open
Bonito now is an API only. This container is set up in a way that favours API access over CORS security. It mirrors every Origin
request header back as Access-Control-Allow-Origin
header in the response.
A container based on Debian with NoSkE built from its source is available from ELTE-DH/NoSketch-Engine-Docker
RYCHLÝ, Pavel. Manatee/Bonito-A Modular Corpus Manager. In: RASLAN. 2007. p. 65-70.
KILGARRIFF, Adam, et al. The Sketch Engine: Ten Years on. Lexicography, 2014, 1.1: 7-36.
We would like to thank everyone that made these remarkable piece of software possible.
The sources used to build this container are available in the https://github.com/acdh-oeaw/noske-ubi9 repository.
The Dockerfile just automates installing dependencies, downloading of the official SRC rpms and rebuilding it in a EL 9 environment (Rocky 9 and UBI9).