Skip to content

Local Models

Lattis serves local models through two engines: a bundled llama-server (llama.cpp) for GGUF models, and an optional mlx_lm.server for MLX models on Apple Silicon.

Lattis runs llama-server in router mode as a child process. The router keeps several models resident at once and swaps the least-recently-used out when you exceed the resident limit, so switching models is fast.

In the app’s Library, pick a model and download it. Weights are pulled directly from Hugging Face — no API token required for the curated catalogue. Each model lives in its own folder under the data directory (see Configuration & Storage).

You can also add a custom model by providing one or more direct GGUF URLs.

The router loads a model on first use and keeps it resident. The maximum number of resident models is configurable; beyond it, the least-recently-used model is evicted. The set that was resident at shutdown is restored on the next start, so the system comes back into the same state.

If Apple’s mlx_lm is installed, Lattis serves MLX models through an mlx_lm.server child process, Metal-accelerated. MLX is detected at startup — when it isn’t available, MLX models are simply hidden.

Today a single MLX model is served at a time; loading another swaps it.

See Installation for setup and how Lattis locates a suitable Python interpreter.

You can drive the router from the Control API (the GUI uses the same endpoints):

Terminal window
# Load a model into the router
curl -X POST http://127.0.0.1:1234/control/load \
-H 'Content-Type: application/json' -d '{"id":"qwen3-4b-instruct-2507"}'
# Unload it
curl -X POST http://127.0.0.1:1234/control/unload \
-H 'Content-Type: application/json' -d '{"id":"qwen3-4b-instruct-2507"}'

Local and connected-cloud models appear together in GET /v1/models. Use a model by passing its id as the model field on any request — see the Public API.