Jlama: Running LLMs Locally in Java

Jlama is an inference engine that allows developers to deploy and run large language models (LLMs) directly on a local machine, without requiring external API calls. Jlama version 0.8.4 supports models via the jlama-native module, facilitating embedding into Java applications.

Why This Matters

Currently, many LLM applications depend on remote APIs, introducing latency, cost, and data privacy concerns. Jlama addresses these drawbacks by enabling localized inference, though performance will depend heavily on the local machine’s hardware capabilities against the scale of available cloud-based models.

Key Insights

Java 21 Preview Features: Jlama leverages Java 21 preview features, specifically the Vector API, for optimized performance.
Model Loading: Jlama can load models directly from the local filesystem or download them automatically from Hugging Face.
Builder Pattern: Jlama uses a declarative builder pattern for configuring generation parameters like session ID, maximum tokens, and temperature.

Working Example

import com.github.tjake.jlama.*;
import java.io.File;
import java.io.IOException;
import java.util.UUID;

public class JlamaExample {
    public static void main(String[] args) throws IOException {
        // available models: https://huggingface.co/tjake
        AbstractModel model = loadModel("./models", "tjake/Llama-3.2-1B-Instruct-JQ4");
        PromptContext prompt = PromptContext.of("Why are llamas so cute?");
        Generator.Response response = model.generateBuilder()
                .session(UUID.randomUUID())
                .promptContext(prompt)
                .ntokens(256)
                .temperature(0.3f)
                .generate();
        System.out.println(response.responseText);
    }
    static AbstractModel loadModel(String workingDir, String model) throws IOException {
        File localModelPath = new Downloader(workingDir, model)
                .huggingFaceModel();
        return ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);
    }
}

Practical Applications

Offline AI Assistance: Building local AI-powered assistants for scenarios with limited or no network connectivity.
Privacy-Focused Applications: Processing sensitive data locally without transmitting it to external servers.

References:

On This Page

Jlama: Running LLMs Locally in Java