Skip to content

Commit

Permalink
wip
Browse files Browse the repository at this point in the history
  • Loading branch information
mariofusco committed Dec 2, 2024
1 parent 9044696 commit d00f39a
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions _posts/2024-11-29-quarkus-jlama.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,22 @@ The features provided by these specialized models need to be integrated into the

== How and why executing LLM inference in pure Java with Jlama

https://github.com/tjake/Jlama[Jlama] is a library allowing to execute LLM inference in pure Java. It supports many LLM model families like Llama, Mistral, Qwen2 and Granite. It also implements out-of-the-box many useful LLM related features like tools calling, embeddings, mixture of experts and even distributed inference.
https://github.com/tjake/Jlama[Jlama] is a library allowing to execute LLM inference in pure Java. It supports many LLM model families like Llama, Mistral, Qwen2 and Granite. It also implements out-of-the-box many useful LLM related features like functions calling, models quantization, mixture of experts and even distributed inference.

Jlama is well integrated with Quarkus through the https://quarkus.io/extensions/io.quarkiverse.langchain4j/quarkus-langchain4j-jlama/[dedicated lanchain4j based extension]. Note that for performance reasons Jlama uses the https://openjdk.org/jeps/469[Vector API] which is still in preview in Java 23, and very likely will be released as a supported feature in Java 25.

In essence Jlama makes it possible to serve an LLM in Java, directly embedded in the same JVM running your Java application, but why could this be useful? Actually this is desirable in many use cases and presents a number of relevant advantages like the following:

. *Similar lifecycle between model and app*: There can be use cases where the model and the application using it have the same lifecycle, so that the development of a new feature in the application also requires a change in the model. Similarly, since prompts are very dependent on the model, when the model gets updated even through fine-tuning, your prompt may need to be replaced. In these situations having the model embedded in the application will contribute to simplify the versioning and traceability of the development cycle.
. *Fast development/prototyping*: Not having to install, configure and interact with an external server can make the development of a LLM-based Java application much easier.
. *Easy models testing*: Running the LLM inference embedded in the JVM also makes it easier to test different models and their integration during the development phase.
. *Security/Portability/Performances*: Performing the model inference in the same JVM instance that is run the application that is using it, eliminates the need of interacting with the LLM only through REST calls, that not only could be impossible in specific secure contexts, but also come with a performance cost caused by the avoidable remote call.
. *Legacy support*: The former point will be especially beneficial for legacy users, still running monolithic applications, who in this way will be also able to include LLM-based capabilities in those applications without changing their architecture or platform.
. *Security*: Performing the model inference in the same JVM instance that is run the application using it, eliminates the need of interacting with the LLM only through REST calls, thus preventing the leak of private data and allowing to enforce user authorization at a much finer grain.
. *Monolithic applications support*: The former point will be also beneficial for users still running monolithic applications, who in this way will be also able to include LLM-based capabilities in those applications without changing their architecture or platform.
. *Monitoring and Observability*: Running the LLM inference in pure Java will also allow simplify monitoring and observability, gathering statistics on the reliability and speed of the LLM response.
. *Developer Experience*: Debuggability will be simplified in the same way, allowing the Java developer to also navigate and debug the Jlama code if necessary.
. Distribution: Having the possibility to run LLM inference embedded in the same Java process will also make it possible to include the model itself into the same fat jar of the application using it (even though this could probably be advisable only in very specific circumstances).
. *Distribution*: Having the possibility to run LLM inference embedded in the same Java process will also make it possible to include the model itself into the same application package of the application using it (even though this could probably be advisable only in very specific circumstances).
. *Edge friendliness*: The possibility of implementing and deploying a self-contained LLM-capable Java application will also make it a better fit than a client/server architecture for edge environments.
. *Embedding of auxiliary LLMs*: Many applications, especially the ones relying on agentic AI patterns, uses many different LLMs at once. For instance a smaller LLM could be used to validate and approve the responses of the main bigger one. In this case an hybrid approach could be convenient, embedding the smaller auxiliary LLMs while keeping serving the main one through a dedicated server.
. *Similar lifecycle between model and app*: There can be use cases where the model and the application using it have the same lifecycle, so that the development of a new feature in the application also requires a change in the model. In these situations having the model embedded in the application will contribute to simplify the development cycle.

== The site summarizer: a pure Java LLM-based application

Expand Down

0 comments on commit d00f39a

Please sign in to comment.