Local Models

Lattis serves local models through two engines: a bundled llama-server (llama.cpp) for GGUF models, and an optional mlx_lm.server for MLX models on Apple Silicon.

GGUF models (llama-server router)

Lattis runs llama-server in router mode as a child process. The router keeps several models resident at once and swaps the least-recently-used out when you exceed the resident limit, so switching models is fast.

Download from the Library

In the app’s Library, pick a model and download it. Weights are pulled directly from Hugging Face — no API token required for the curated catalogue. Each model lives in its own folder under the data directory (see Configuration & Storage).

You can also add a custom model by providing one or more direct GGUF URLs.

Resident set & swapping

The router loads a model on first use and keeps it resident. The maximum number of resident models is configurable; beyond it, the least-recently-used model is evicted. The set that was resident at shutdown is restored on the next start, so the system comes back into the same state.

MLX models (Apple Silicon)

If Apple’s mlx_lm is installed, Lattis serves MLX models through an mlx_lm.server child process, Metal-accelerated. MLX is detected at startup — when it isn’t available, MLX models are simply hidden.

Today a single MLX model is served at a time; loading another swaps it.

See Installation for setup and how Lattis locates a suitable Python interpreter.

Serving controls

You can drive the router from the Control API (the GUI uses the same endpoints):

# Load a model into the router
curl -X POST http://127.0.0.1:1234/control/load \
  -H 'Content-Type: application/json' -d '{"id":"qwen3-4b-instruct-2507"}'

# Unload it
curl -X POST http://127.0.0.1:1234/control/unload \
  -H 'Content-Type: application/json' -d '{"id":"qwen3-4b-instruct-2507"}'

Local and connected-cloud models appear together in GET /v1/models. Use a model by passing its id as the model field on any request — see the Public API.