Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 5.3 and above always crash (fasttext) #25

Open
Aculeasis opened this issue Oct 15, 2021 · 8 comments
Open

Version 5.3 and above always crash (fasttext) #25

Aculeasis opened this issue Oct 15, 2021 · 8 comments
Assignees

Comments

@Aculeasis
Copy link

erikvl87/languagetool:5.2 works fine:

The following configuration is passed to LanguageTool:
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
+ java -Xms512m -Xmx2g -cp languagetool-server.jar org.languagetool.server.HTTPServer --port 8010 --public --allow-origin '*' --config config.properties
2021-10-15 21:53:06 +0000 INFO  org.languagetool.server.DatabaseAccess Not setting up database access, dbDriver is not configured
2021-10-15 21:53:06 +0000 WARNING: running in HTTP mode, consider running LanguageTool behind a reverse proxy that takes care of encryption (HTTPS)
2021-10-15 21:53:06 +0000 WARNING: running in public mode, LanguageTool API can be accessed without restrictions!
2021-10-15 21:53:07 +0000 INFO  org.languagetool.language.LanguageIdentifier Started fasttext process for language identification: Binary /fasttext/fasttext with model @ /fasttext/lid.176.bin
2021-10-15 21:53:07 +0000 Setting up thread pool with 10 threads
2021-10-15 21:53:07 +0000 Starting LanguageTool 5.2 (build date: 2020-12-30 14:55, eb572bf) server on http://localhost:8010...
2021-10-15 21:53:07 +0000 Server started

But newer versions already crash :(

fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
fasttextBinary=/fasttext/fasttext
fasttextModel=/fasttext/lid.176.bin
languageModel=/ngrams
+ java -Xms512m -Xmx2g -cp languagetool-server.jar org.languagetool.server.HTTPServer --port 8010 --public --allow-origin '*' --config config.properties
2021-10-15 22:01:37.371 +0000 INFO  org.languagetool.server.DatabaseAccess Not setting up database access, dbDriver is not configured
2021-10-15 22:01:37 +0000 WARNING: running in HTTP mode, consider running LanguageTool behind a reverse proxy that takes care of encryption (HTTPS)
2021-10-15 22:01:37 +0000 WARNING: running in public mode, LanguageTool API can be accessed without restrictions!
Exception in thread "main" java.lang.RuntimeException: Could not start LanguageTool HTTP server on localhost, port 8010
	at org.languagetool.server.HTTPServer.main(HTTPServer.java:153)
Caused by: org.languagetool.server.PortBindingException: LanguageTool HTTP server could not be started on host "null", port 8010.
Maybe something else is running on that port already?
	at org.languagetool.server.HTTPServer.<init>(HTTPServer.java:119)
	at org.languagetool.server.HTTPServer.main(HTTPServer.java:147)
Caused by: java.lang.RuntimeException: Could not start fasttext process for language identification @ /fasttext/fasttext with model @ /fasttext/lid.176.bin
	at org.languagetool.language.LanguageIdentifier.enableFasttext(LanguageIdentifier.java:118)
	at org.languagetool.server.TextChecker.<init>(TextChecker.java:109)
	at org.languagetool.server.V2TextChecker.<init>(V2TextChecker.java:45)
	at org.languagetool.server.LanguageToolHttpHandler.<init>(LanguageToolHttpHandler.java:74)
	at org.languagetool.server.HTTPServer.<init>(HTTPServer.java:105)
	... 1 more
Caused by: java.io.IOException: Cannot run program "/fasttext/fasttext": error=2, No such file or directory
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
	at org.languagetool.language.FastText.<init>(FastText.java:43)
	at org.languagetool.language.LanguageIdentifier.enableFasttext(LanguageIdentifier.java:115)
	... 5 more
Caused by: java.io.IOException: error=2, No such file or directory
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
	at java.base/java.lang.ProcessBuilder.start(Process

I built fasttext from here and downloaded, probably, lid.176.bin from here.
My docker runner:

docker run -d --name="Languagetool" \
-p 8081:8010/tcp \
-e Java_Xms=512m \
-e Java_Xmx=2g \
-e langtool_languageModel=/ngrams \
-e langtool_fasttextModel=/fasttext/lid.176.bin \
-e langtool_fasttextBinary=/fasttext/fasttext \
-v "/mnt/hdd1/languagetool/ngrams":"/ngrams" \
-v "/mnt/hdd1/languagetool/fasttext":"/fasttext" \
--restart=unless-stopped \
erikvl87/languagetool:5.2

docker version:

Client:
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.8
 Git commit:        20.10.7-0ubuntu1~20.04.2
 Built:             Fri Oct  1 14:07:06 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       20.10.7-0ubuntu1~20.04.2
  Built:            Fri Oct  1 03:27:17 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.2-0ubuntu1~20.04.3
  GitCommit:        
 runc:
  Version:          1.0.0~rc95-0ubuntu1~20.04.2
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        

So, what am I doing wrong?

@Erikvl87 Erikvl87 self-assigned this Oct 16, 2021
@dprothero
Copy link

I'm working my way through this and haven't gotten all the way there yet, but I did resolve the "No such file or directory" issue. The fasttext binary has to be built on alpine linux to work. I'll post my completed setup when I get it working. Now, I'm getting java.lang.OutOfMemoryError loading the ngram data for language identification.

@dprothero
Copy link

If you create a Dockerfile in an empty folder with these contents:

FROM alpine as ftbuild

RUN apk update && apk add \
        build-base \
        wget \
        git \
        unzip \
        && rm -rf /var/cache/apk/*

RUN git clone https://github.com/facebookresearch/fastText.git /tmp/fastText && \
  rm -rf /tmp/fastText/.git* && \
  mv /tmp/fastText/* / && \
  cd / && \
  make

RUN wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

RUN wget https://languagetool.org/download/ngram-lang-detect/model_ml50_new.zip

FROM erikvl87/languagetool

COPY --chown=languagetool --from=ftbuild /fasttext .
COPY --chown=languagetool --from=ftbuild /model_ml50_new.zip .
COPY --chown=languagetool --from=ftbuild /lid.176.bin .

ENV Java_Xms=512m
ENV Java_Xmx=1500m
ENV langtool_fasttextBinary=/LanguageTool/fasttext
ENV langtool_ngramLangIdentData=/LanguageTool/model_ml50_new.zip
ENV langtool_fasttextModel=/LanguageTool/lid.176.bin

You can then build it with:

docker build -t docker-languagetool-fasttext .

And then you would run it like so (this is based off your command you provided above):

docker run -d --name="Languagetool" \
-p 8081:8010/tcp \
-e langtool_languageModel=/ngrams \
-v "/mnt/hdd1/languagetool/ngrams":"/ngrams" \
--restart=unless-stopped \
docker-languagetool-fasttext

@Aculeasis
Copy link
Author

Yes, it starts and i have the same problem with java.lang.OutOfMemoryError:

java.lang.OutOfMemoryError: Java heap space
	at org.apache.lucene.util.fst.FST.<init>(FST.java:387)
	at org.apache.lucene.util.fst.FST.<init>(FST.java:313)
	at org.apache.lucene.codecs.blocktree.FieldReader.<init>(FieldReader.java:91)
	at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:231)
	at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:446)
	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:261)
	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:341)
	at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:104)
	at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:65)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:58)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:50)
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:731)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:50)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
	at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel$LuceneSearcher.<init>(LuceneSingleIndexLanguageModel.java:241)
	at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel$LuceneSearcher.<init>(LuceneSingleIndexLanguageModel.java:229)
	at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel.getCachedLuceneSearcher(LuceneSingleIndexLanguageModel.java:182)
	at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel.addIndex(LuceneSingleIndexLanguageModel.java:118)
	at org.languagetool.languagemodel.LuceneSingleIndexLanguageModel.<init>(LuceneSingleIndexLanguageModel.java:95)
	at org.languagetool.languagemodel.LuceneLanguageModel.<init>(LuceneLanguageModel.java:65)
	at org.languagetool.Language.initLanguageModel(Language.java:180)
	at org.languagetool.language.English.getLanguageModel(English.java:144)
	at org.languagetool.JLanguageTool.activateLanguageModelRules(JLanguageTool.java:566)
	at org.languagetool.server.Pipeline.activateLanguageModelRules(Pipeline.java:121)
	at org.languagetool.server.PipelinePool.createPipeline(PipelinePool.java:204)
	at org.languagetool.server.PipelinePool.getPipeline(PipelinePool.java:180)
	at org.languagetool.server.TextChecker.getPipelineResults(TextChecker.java:757)
	at org.languagetool.server.TextChecker.getRuleMatches(TextChecker.java:711)
	at org.languagetool.server.TextChecker.access$000(TextChecker.java:56)
	at org.languagetool.server.TextChecker$1.call(TextChecker.java:427)
	at org.languagetool.server.TextChecker$1.call(TextChecker.java:420)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)

@Erikvl87
Copy link
Owner

Erikvl87 commented Nov 3, 2021

Sorry that I've kept you waiting. I unfortunately didn't had the time yet to look into this. I'll do my best to take a look soon.
Meanwhile, would the provided solution of @dprothero work in combination with increasing the memory options?

You can do this by increasing the Java_Xms and Java_Xmx variables. In the Dockerfile example given above, that means increasing these lines (e.g. to 1g and 2g respectively):

ENV Java_Xms=512m
ENV Java_Xmx=1500m

Alternatively, take a look at the Java heap size settings explained over here:
https://github.com/Erikvl87/docker-languagetool#java-heap-size

@Erikvl87
Copy link
Owner

@Aculeasis, The provided solution of @dprothero seems to work here as well.

I think the example above is useful to include in the README.md so I will keep this ticket open until I've updated the readme file.

@Aculeasis
Copy link
Author

Sorry for delay.
I set 1g and 2g. It works but falls sometimes.
So, I set 2 and 4 it works well. But, 4 GB is it not too much?

@Erikvl87
Copy link
Owner

Erikvl87 commented Nov 22, 2021

@Aculeasis That should be a question for the official LanguageTool developers. From what I could find is that they don't have an official set of requirements regarding memory configuration:

There's no general rule, it depends on the number of languages being used, the concurrent requests, the text length etc. 2600MB should be enough for most use cases, if you don't have that much, try with less and see how that works.

Source: languagetool-org/languagetool#902 (comment)

@FarisZR
Copy link

FarisZR commented Jun 7, 2022

Is there a reason this can't be included in the docker image?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants