server: add presets (config) when using multiple models#17859

ServeurpersoCom · 2025-12-08T10:23:18Z

Router per-model config

This PR implements INI-based per-model configuration for llama-server router mode, as discussed in #17850.

Summary

Advanced users can define custom configurations using an .ini file. Each model can have its own preset with custom parameters while inheriting router defaults for unspecified options.

Motivation

Multi-model inference servers for small/medium teams need declarative configuration with zero operational friction. Operators should be able to set global defaults via router CLI and override specific parameters per model in a simple text file.

Key Features

INI-based presets - Define model-specific configurations in a simple .ini file
Three independent model sources - Cached models (LLAMA_CACHE), local GGUF files (--models-dir), and custom-path models (--models-preset via paths defined in INI file)
Flexible argument formats - Accepts long args (ctx-size), short args (c), and env vars (LLAMA_ARG_CTX_SIZE) as INI keys
Inheritance system - Preset args are merged with router base args before spawning child processes
Custom model paths - Define models with absolute paths directly in presets without filesystem scanning

Usage

llama-server --models-preset ./presets.ini

The router can combine multiple sources:

llama-server --models-dir ./local-models --models-preset ./custom-configs.ini -ngl 999 -fa

INI Format

Section names define model identifiers. Keys correspond to CLI arguments without leading dashes.

Supported key formats:

Long form: n-gpu-layer = 123
Short form: c = 4096
Env var: LLAMA_ARG_CACHE_RAM = 0

All three formats are equivalent and can be mixed in the same file.

Example presets.ini:

version = 1 ; Preset for cached HuggingFace model[ggml-org/gemma-3-27b-it-GGUF:Q6_K]chat-template = chatml ngl = 123 jinja = on ctx-size = 131072 ; Custom local model with absolute path[my-custom-model]m = /absolute/path/to/model.gguf mmproj = /absolute/path/to/mmproj.gguf ctx-size = 65536 temp = 0.7 top-p = 0.8 ; MoE model with specific settings[MoE-Qwen3-30B-A3B-Thinking]m = /models/Qwen3-30B-A3B-Thinking-Q6_K.gguf n-cpu-moe = 30 temp = 0.6 top-p = 0.95 ctx-size = 32768

How Models Are Loaded

The router discovers models from three sources:

Cached models - Scanned from LLAMA_CACHE (typically ~/.cache/llama.cpp)
Local directory - Scanned from --models-dir (non-recursive, direct children only)
Preset definitions - Custom models defined in --models-preset with explicit paths

Model names from presets can match cached or local models to apply custom configurations, or define entirely new models with custom paths.

Argument Inheritance

When spawning a child process for a model, arguments are merged in this order:

Start with preset args from INI (model-specific settings)
Add router base args for any missing keys (global defaults from router CLI)
Force control args (port, host, alias - always overridden by router)

Priority order (highest to lowest):

Control args (port, host, alias, model path) - managed by router, cannot be overridden
Router base args (inherited from router CLI) - fill in missing preset keys
Preset args (from INI) - model-specific overrides

Control args automatically managed by router:

--port, --host, --alias
--api-key
--model, --mmproj, --hf-repo
--models-dir, --models-max, --models-preset

If a preset contains control args, they are removed with a warning.

Changes

New files:

common/preset.cpp - INI parser using PEG grammar and preset management
common/preset.h - Preset structures and API

Modified files:

common/arg.cpp/h - Added common_params_parse() for map output, is_truthy/is_falsey/is_autoy utilities
common/common.h - Added models_preset parameter
common/CMakeLists.txt - Added preset.cpp/h to build
tools/server/server-models.cpp/h - Integrated preset system with model loading and spawning
tools/server/README.md - Added preset documentation and examples

Technical Details

INI parsing:

Uses existing PEG parser from common/peg-parser.h (grammar by @aldehir)
Line-oriented parsing handles comments, blank lines, and standard INI sections
Whitespace and inline comments properly handled

Argument mapping:

Three key formats (long, short, env) all map to same common_arg via lookup table
Deduplication handled automatically (short -c and long --ctx-size are the same arg)

Child process spawning:

Child servers listen on 127.0.0.1 (not inherited hostname) to avoid conflicts when router runs on 0.0.0.0
Arguments passed as CLI args (not environment variables)
Router port exported as LLAMA_SERVER_ROUTER_PORT env var for child processes

Use Case Example

Development team runs inference server with multiple models:

llama-server --port 8082 -ngl 999 -ctk q8_0 -ctv q8_0 -fa on --mlock -np 4 -kvu --jinja --models-preset config.ini # You can combine --models-preset with :# --models-max N for router server, maximum number of models to load simultaneously# --models-dir PATH directory containing models for the router server (default: disabled)

The presets.ini file defines per-model overrides:

; Minimal setup[MyModel]m = /path/to/model.gguf ; or relative path from current working directory; For this model we want precise sampling parameters and more context[MoE-Qwen3-Coder-30B-A3B-Instruct]m = /path/to/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-Q6_K.gguf temp = 0.7 top-p = 0.8 top-k = 20 ctx-size = 131072 ; Disable flash attention for this model[Problematic-model]m = /path/to/problematic-Q8_0.gguf fa = off ; Large MoE models don't fit in VRAM, so we use n-cpu-moe = 18[MoE-Qwen3-Next-80B-A3B-Instruct]m = /path/to/unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf n-cpu-moe = 18 ctx-size = 32768

Global defaults (-ngl 999 -fa -ctk q8_0 etc.) apply to all models, but each preset can override specific parameters. The router automatically manages ports, aliases, and model paths.

Testing Status

Tested configurations:

Multiple cached HuggingFace models with various quantizations
Local GGUF files with mmproj auto-detection
Custom path models defined in presets
Mixed sources (local + preset) in single router instance
Argument inheritance and override behavior

Notes

File paths in INI can be absolute or relative to server working directory
--models-dir and --models-preset are independent and can be used together
Presets are logged at startup with * prefix indicating custom configuration
/v1/models endpoint includes preset configuration in response for debugging
Boolean flags can use on/off, enabled/disabled, true/false, or 1/0 as values

Related Issues

Closes#17850
Related to #17470, #10932

Thanks to

Co-authored-by: aldehir (INI parser PEG grammar)
Co-authored-by: ngxson (llama-server integrated router/model-childs, preset refactoring, API design, argument system integration, ...)

tools/server/server-config.cpp

ngxson

this looks interesting, I will clean this up a bit and push a commit

common/arg.h

ngxson · 2025-12-08T10:48:33Z

tools/server/server-config.cpp

Will be nice if we can move part of this file into common/preset.cpp|h, so it can be reused by other tools

tools/server/server-models.cpp

ngxson · 2025-12-08T10:52:22Z

tools/server/server-models.cpp

+if (value == "false"){
+continue;
+ }
+
+if (value == "true" || value.empty()){
+ child_env.push_back(key + "=");


I think leaving the original value for bool should be good? We can already handle these values using is_falsey / is_truthy in arg.cpp

Good point! I'll simplify the bool handling to pass through the original values (=true/=false) and let is_truthy/is_falsey handle the conversion

tools/server/server-config.cpp

aldehir · 2025-12-08T11:22:40Z

@ServeurpersoCom

Here is a line-oriented approach for the parser:

staticconstauto ini_parser = build_peg_parser([](auto & p){// newline ::= "\r\n" / "\n" / "\r"auto newline = p.rule("newline", p.literal("\r\n") | p.literal("\n") | p.literal("\r")); // ws ::= [ \t]*auto ws = p.rule("ws", p.chars("[ \t]", 0, -1)); // comment ::= [;#] (!newline .)*auto comment = p.rule("comment", p.chars("[;#]", 1, 1) + p.zero_or_more(p.negate(newline) + p.any())); // eol ::= ws comment? (newline / EOF)auto eol = p.rule("eol", ws + p.optional(comment) + (newline | p.end())); // ident ::= [a-zA-Z_] [a-zA-Z0-9_.-]*auto ident = p.rule("ident", p.chars("[a-zA-Z_]", 1, 1) + p.chars("[a-zA-Z0-9_.-]", 0, -1)); // value ::= (!eol-start .)*auto eol_start = p.rule("eol-start", ws + (p.chars("[;#]", 1, 1) | newline | p.end())); auto value = p.rule("value", p.zero_or_more(p.negate(eol_start) + p.any())); // header-line ::= "[" ws ident ws "]" eolauto header_line = p.rule("header-line", "[" + ws + p.tag("section-name", p.chars("[^]]")) + ws + "]" + eol); // kv-line ::= ident ws "=" ws value eolauto kv_line = p.rule("kv-line", p.tag("key", ident) + ws + "=" + ws + p.tag("value", value) + eol); // comment-line ::= ws comment (newline / EOF)auto comment_line = p.rule("comment-line", ws + comment + (newline | p.end())); // blank-line ::= ws (newline / EOF)auto blank_line = p.rule("blank-line", ws + (newline | p.end())); // line ::= header-line / kv-line / comment-line / blank-lineauto line = p.rule("line", header_line | kv_line | comment_line | blank_line); // ini ::= line* EOFauto ini = p.rule("ini", p.zero_or_more(line) + p.end()); return ini});

I assume the changes were because of the weirdness in consuming spaces/comments. This should alleviate those concerns.

And the visitor can really be something as simple as this:

std::map<std::string, std::map<std::string, std::string>> cfg; std::string current_section = "default"; std::string current_key; ctx.ast.visit(result, [&](constauto & node){if (node.tag == "section-name"){current_section = std::string(node.text); cfg[current_section] ={}} elseif (node.tag == "key"){current_key = std::string(node.text)} elseif (node.tag == "value" && !current_key.empty()){cfg[current_section][current_key] = std::string(node.text); current_key.clear()} });

aldehir

Looks good as far as parsing is concerned!

I will need to add an expect() helper to provide helpful error messages to users when they make a mistake. I can do that separately in an another PR.

ServeurpersoCom · 2025-12-08T12:32:44Z

Now it in a basic working state, with the new line-based PEG parser, I'm testing with my entire per models configuration on the server to test some edge case, and then there are the @ngxson refactoring to do.

ServeurpersoCom · 2025-12-08T12:44:15Z

Missing sampling parameters need .set_env() in common/arg.cpp (--temp, --top-p, --top-k, --min-p have no LLAMA_ARG_ env vars yet). Successfully migrated llama-swap config (YAML) to config.ini via LLM: llama-server preserved all custom parameters (ctx-size, n-cpu-moe, mmproj, -m ....Q6_K), applied global CLI defaults (-ngl 999, -fa, --mlock, -ctk/-ctv etc...) to all models, and automatically reorganized sections/keys alphabetically to maintain normalized format

ngxson · 2025-12-08T13:59:37Z

Missing sampling parameters need .set_env() in common/arg.cpp (--temp, --top-p, --top-k, --min-p have no LLAMA_ARG_ env vars yet).

Hmm yeah I didn't notice that some env vars are missing. I think it will be cleaner if we default to using the longest arg (for example, --ctx-size instead of -c)

Internally, the parser can accept all 3 forms: env, short arg and long arg ; there is no chance that they will collide anyway. I'll push the change for this

ServeurpersoCom · 2025-12-08T14:06:45Z

Yes look it just need missing .set_env("LLAMA_ARG_TEMP")); etc... I wait your change while I run some tests

ngxson · 2025-12-08T14:41:27Z

tools/server/README.md

 llama-server --models-dir ./models_directory
 ```

+The directory is scanned recursively, so nested vendor/model layouts such as `vendor_name/model_name/*.gguf` are supported. The model name in the router UI matches the relative path inside `--models-dir` (for example, `vendor_name/model_name`).


For visibility, I will remove recursive support from this PR because it's not related to config support - it should be added later via a dedicated PR

Yes no worries, (I have to keep it on my side otherwise it breaks my integration server) -> I would adapt the configuration on my side to test this feature-atomic PR if necessary

tools/server/server-models.cpp

ngxson · 2025-12-08T17:55:58Z

I moved most of the code inside server-config.cpp to common/preset.cpp

We're now using the term "preset", so I think it's easier to make the file name presets.ini now (it can be extended to use outside of server)

Since I'm now using the same common_arg to handle everything, including parsing and merging args, edge cases like deduplication of short form -a and long form --abc is also handled

We don't yet support repeated args or args with 2 values (like --lora-scaled) but it can be added in the future

API endpoint /v1/models also extended to include the args and INI preset, which will be quite useful for debugging

Things that still need to improve:

add falsey and truthy check for input from ini
add documentation and example

ngxson · 2025-12-08T20:56:40Z

tools/server/README.md


+Alternatively, you can also add GGUF based preset (see next section)
+
+### Model presets


@ServeurpersoCom I updated the docs with an example - lmk if this works in your case

ngxson · 2025-12-08T21:08:37Z

tools/server/server-models.cpp

- first_shard_file = file;
- } else{
- model_file = file;
+ std::function<void(const std::string &, const std::string &)> scan_subdir =


Please remove the recursive implementation - it's unrelated to the current PR, and it's also unsafe as it doesn't handle the case where there's a circular symlink or circular mount points

You can retrieve the rest (except the recursion) and push --force; I won't touch the branch before tomorrow/rebase/test.
A two-level browsing system will be perfect for all cases (separate PR)

emjomi · 2025-12-08T21:50:35Z

Hey guys! Sorry to interrupt, but are the LLAMA_ARG_ prefixes required? I think they make the config a bit noisy.

One more thing: maybe it's better to put the config in ~/.config/llama.cpp/ on Linux, as specified in https://specifications.freedesktop.org/basedir/latest/?

Thank you so much for what you're doing!

ServeurpersoCom · 2025-12-08T22:41:27Z

Hey guys! Sorry to interrupt, but are the LLAMA_ARG_ prefixes required? I think they make the config a bit noisy.
One more thing: maybe it's better to put the config in ~/.config/llama.cpp/ on Linux, as specified in https://specifications.freedesktop.org/basedir/latest/?
Thank you so much for what you're doing!

No worries! with the last refactor the LLAMA_ARG_ prefixes are optional: you can use the short argument forms (e.g., ngl, c) or long forms with dashes (e.g., n-gpu-layers, ctx-size) instead. All three formats are supported.

Regarding config location: the preset file path is fully customizable via --models-preset , so you can place it wherever you prefer, including ~/.config/llama.cpp/presets.ini if that fits your workflow better.

This is a WIP, I update the first message soon

ServeurpersoCom · 2025-12-10T07:19:44Z

I'll update the PR documentation with the new implementation today: no more INI auto-generation, deep GGUF tree support without scanning, all 3 variable formats supported, standard Linux binary/INI relative paths, and --models-dir and --models-preset are now independent

ServeurpersoCom · 2025-12-10T12:47:40Z

Let's get to the practical tests!

CLI

# Model collection is current directory cd /var/www/ia/models # System path or absolute path to launch llama-server llama-server --port 8082 -ngl 999 -ctk q8_0 -ctv q8_0 -fa on --mlock -np 4 -kvu --jinja --models-max 1 --models-dir mradermacher/testing --models-preset config.ini

config.ini

[Dense-OLMo-2-0325-32B-Instruct] m = unsloth/OLMo-2-0325-32B-Instruct-GGUF/OLMo-2-0325-32B-Instruct-Q6_K.gguf ctx-size = 4096 [Dense-Vision-Mistral-Small-3.2-24B-Instruct-2506] m = unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/Mistral-Small-3.2-24B-Instruct-2506-Q6_K.gguf mmproj = unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/mmproj-BF16.gguf ctx-size = 65536 [Dense-Vision-Magistral-Small-2509] m = unsloth/Magistral-Small-2509-GGUF/Magistral-Small-2509-Q6_K.gguf mmproj = unsloth/Magistral-Small-2509-GGUF/mmproj-BF16.gguf ctx-size = 65536 [Dense-Uncensored-Dolphin-Mistral-24B-Venice-Edition] m = bartowski/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-GGUF/cognitivecomputations_Dolphin-Mistral-24B-Venice-Edition-Q8_0.gguf ctx-size = 65536 [Dense-Uncensored-BlackSheep-24B] m = mradermacher/BlackSheep-24B-i1-GGUF/BlackSheep-24B.Q8_0.gguf ctx-size = 65536 [Dense-Uncensored-XortronCriminalComputingConfig-24B] m = mradermacher/XortronCriminalComputingConfig-i1-GGUF/XortronCriminalComputingConfig.Q8_0.gguf ctx-size = 65536 [Dense-RP-Cydonia-24B-v4.1] m = bartowski/TheDrummer_Cydonia-24B-v4.1-GGUF/TheDrummer_Cydonia-24B-v4.1-Q8_0.gguf ctx-size = 65536 [Dense-Devstral-Small-24B-2507] m = unsloth/Devstral-Small-2507-GGUF/Devstral-Small-2507-Q6_K.gguf ctx-size = 131072 [Dense-Codestral-22B-v0.1] m = mradermacher/Codestral-22B-v0.1-i1-GGUF/Codestral-22B-v0.1.Q8_0.gguf ctx-size = 32768 [Dense-Vision-Gemma-3-27B-IT] m = unsloth/gemma-3-27b-it-GGUF/gemma-3-27b-it-Q6_K.gguf mmproj = unsloth/gemma-3-27b-it-GGUF/mmproj-BF16.gguf ctx-size = 131072 [Dense-RP-Big-Tiger-Gemma-27B-v3] m = bartowski/TheDrummer_Big-Tiger-Gemma-27B-v3-GGUF/TheDrummer_Big-Tiger-Gemma-27B-v3-Q6_K.gguf ctx-size = 131072 [Dense-Seed-OSS-36B-Instruct] m = unsloth/Seed-OSS-36B-Instruct-GGUF/Seed-OSS-36B-Instruct-Q5_K_M.gguf ctx-size = 32768 [Dense-DeepSeek-Coder-33B-Instruct] m = mradermacher/deepseek-coder-33b-instruct-i1-GGUF/deepseek-coder-33b-instruct.i1-Q6_K.gguf ctx-size = 32768 [Dense-DeepSeek-R1-Distill-Qwen-32B] m = unsloth/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf ctx-size = 32768 [Dense-Aya-Expanse-32B] m = mradermacher/aya-expanse-32b-i1-GGUF/aya-expanse-32b.i1-Q6_K.gguf ctx-size = 32768 [Dense-GLM-4-32B-0414] m = unsloth/GLM-4-32B-0414-GGUF/GLM-4-32B-0414-Q6_K.gguf ctx-size = 32768 [Dense-GLM-Z1-32B-0414] m = unsloth/GLM-Z1-32B-0414-GGUF/GLM-Z1-32B-0414-Q6_K.gguf ctx-size = 32768 [MoE-GLM-4.5-Air-106B] m = unsloth/GLM-4.5-Air-GGUF/GLM-4.5-Air-Q4_K_M-00001-of-00002.gguf n-cpu-moe = 30 ctx-size = 32768 [Dense-EXAONE-4.0.1-32B] m = mradermacher/EXAONE-4.0.1-32B-i1-GGUF/EXAONE-4.0.1-32B.i1-Q6_K.gguf ctx-size = 131072 [Dense-QwQ-32B] m = unsloth/QwQ-32B-GGUF/QwQ-32B-Q6_K.gguf ctx-size = 32768 [Dense-Qwen3-32B] m = mradermacher/Qwen3-32B-i1-GGUF/Qwen3-32B.i1-Q6_K.gguf ctx-size = 32768 [Dense-Vision-Qwen2.5-VL-32B-Instruct] m = unsloth/Qwen2.5-VL-32B-Instruct-GGUF/Qwen2.5-VL-32B-Instruct-Q5_K_M.gguf mmproj = unsloth/Qwen2.5-VL-32B-Instruct-GGUF/mmproj-BF16.gguf ctx-size = 32768 [Dense-Vision-Qwen3-VL-32B-Instruct] m = unsloth/Qwen3-VL-32B-Instruct-GGUF/Qwen3-VL-32B-Instruct-Q5_K_M.gguf mmproj = unsloth/Qwen3-VL-32B-Instruct-GGUF/mmproj-BF16.gguf ctx-size = 32768 [MoE-Qwen3-30B-A3B-Instruct-2507] m = mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf temp = 0.7 top-p = 0.8 top-k = 20 min-p = 0 ctx-size = 32768 [MoE-Qwen3-Coder-30B-A3B-Instruct] m = unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-Q6_K.gguf temp = 0.7 top-p = 0.8 top-k = 20 min-p = 0 ctx-size = 131072 [MoE-Qwen3-30B-A3B-Thinking-2507] m = mradermacher/Qwen3-30B-A3B-Thinking-2507-i1-GGUF/Qwen3-30B-A3B-Thinking-2507.i1-Q6_K.gguf temp = 0.6 top-p = 0.95 top-k = 20 min-p = 0 ctx-size = 32768 [MoE-Aquif-3.5-Max-42B-A3B] m = unsloth/aquif-3.5-Max-42B-A3B-GGUF/aquif-3.5-Max-42B-A3B-Q4_K_M.gguf ctx-size = 65536 chat-template-file = unsloth/aquif-3.5-Max-42B-A3B-GGUF/aquif-3.5-Max-42B-A3B.jinja [MoE-Aquif-3.5-Plus-30B-A3B] m = mradermacher/aquif-3.5-Plus-30B-A3B-i1-GGUF/aquif-3.5-Plus-30B-A3B.i1-Q6_K.gguf ctx-size = 131072 [MoE-MiniMax-M2-230B-A10B] m = unsloth/MiniMax-M2-GGUF/MiniMax-M2-UD-Q2_K_XL-00001-of-00002.gguf temp = 1.0 top-p = 0.95 top-k = 40 n-cpu-moe = 50 ctx-size = 65536 [MoE-GPT-OSS-20B] m = lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf ctx-size = 65536 [MoE-GPT-OSS-120B] m = lmstudio-community/gpt-oss-120b-GGUF/gpt-oss-120b-MXFP4-00001-of-00002.gguf n-cpu-moe = 20 ctx-size = 65536 [Dense-Llama-3_3-Nemotron-Super-49B-v1_5] m = unsloth/Llama-3_3-Nemotron-Super-49B-v1_5-GGUF/Llama-3_3-Nemotron-Super-49B-v1_5-Q4_K_S.gguf ctx-size = 32768 [Dense-RP-Valkyrie-49B-v2] m = bartowski/TheDrummer_Valkyrie-49B-v2-GGUF/TheDrummer_Valkyrie-49B-v2-IQ4_NL.gguf ctx-size = 32768 [Dense-K2-Think-32B] m = mradermacher/K2-Think-i1-GGUF/K2-Think.i1-Q6_K.gguf ctx-size = 32768 [MoE-Granite-4.0-h-small-32B] m = unsloth/granite-4.0-h-small-GGUF/granite-4.0-h-small-UD-Q6_K_XL.gguf ctx-size = 131072

LOG

init: using 31 threads for HTTP server srv server_prese: Loaded 35 presets from config.ini srv load_models: Available models (37) (*: custom preset) srv load_models: * Dense-Aya-Expanse-32B srv load_models: * Dense-Codestral-22B-v0.1 srv load_models: * Dense-DeepSeek-Coder-33B-Instruct srv load_models: * Dense-DeepSeek-R1-Distill-Qwen-32B srv load_models: * Dense-Devstral-Small-24B-2507 srv load_models: * Dense-EXAONE-4.0.1-32B srv load_models: * Dense-GLM-4-32B-0414 srv load_models: * Dense-GLM-Z1-32B-0414 srv load_models: * Dense-K2-Think-32B srv load_models: * Dense-Llama-3_3-Nemotron-Super-49B-v1_5 srv load_models: * Dense-OLMo-2-0325-32B-Instruct srv load_models: * Dense-QwQ-32B srv load_models: * Dense-Qwen3-32B srv load_models: * Dense-RP-Big-Tiger-Gemma-27B-v3 srv load_models: * Dense-RP-Cydonia-24B-v4.1 srv load_models: * Dense-RP-Valkyrie-49B-v2 srv load_models: * Dense-Seed-OSS-36B-Instruct srv load_models: * Dense-Uncensored-BlackSheep-24B srv load_models: * Dense-Uncensored-Dolphin-Mistral-24B-Venice-Edition srv load_models: * Dense-Uncensored-XortronCriminalComputingConfig-24B srv load_models: * Dense-Vision-Gemma-3-27B-IT srv load_models: * Dense-Vision-Magistral-Small-2509 srv load_models: * Dense-Vision-Mistral-Small-3.2-24B-Instruct-2506 srv load_models: * Dense-Vision-Qwen2.5-VL-32B-Instruct srv load_models: * Dense-Vision-Qwen3-VL-32B-Instruct srv load_models: * MoE-Aquif-3.5-Max-42B-A3B srv load_models: * MoE-Aquif-3.5-Plus-30B-A3B srv load_models: * MoE-GLM-4.5-Air-106B srv load_models: * MoE-GPT-OSS-120B srv load_models: * MoE-GPT-OSS-20B srv load_models: * MoE-Granite-4.0-h-small-32B srv load_models: * MoE-MiniMax-M2-230B-A10B srv load_models: * MoE-Qwen3-30B-A3B-Instruct-2507 srv load_models: * MoE-Qwen3-30B-A3B-Thinking-2507 srv load_models: * MoE-Qwen3-Coder-30B-A3B-Instruct srv load_models: gemma-3-1b-it-i1-GGUF <- Two models from --models-dir mradermacher/testing srv load_models: gemma-3-4b-it-i1-GGUF main: starting router server, no model will be loaded in this process start: binding port with default address family main: router server is listening on http://127.0.0.1:8082 main: NOTE: router mode is experimental main: it is not recommended to use this mode in untrusted environments

--models-dir mradermacher/testing : the parameters are inherited from the main command line

srv operator(): instance name=gemma-3-1b-it-i1-GGUF exited with status 0 srv log_server_r: request: GET /v1/models 127.0.0.1 200 srv load: spawning server instance with name=gemma-3-4b-it-i1-GGUF on port 36237 srv load: spawning server instance with args: srv load: /root/llama.cpp.pascal/build/bin/llama-server srv load: --host srv load: 127.0.0.1 srv load: --jinja srv load: -kvu srv load: --mlock srv load: --port srv load: 36237 srv load: --alias srv load: gemma-3-4b-it-i1-GGUF srv load: --cache-type-k srv load: q8_0 srv load: --cache-type-v srv load: q8_0 srv load: --flash-attn srv load: on srv load: --model srv load: mradermacher/testing/gemma-3-4b-it-i1-GGUF/gemma-3-4b-it.i1-Q6_K.gguf srv load: --n-gpu-layers srv load: 999 srv load: --parallel srv load: 4 srv log_server_r: request: POST /models/load 127.0.0.1 200

Model from --models-preset config.ini with parameters inherited from the main command line + preset overload (sampling...)

srv load: spawning server instance with name=MoE-Qwen3-30B-A3B-Instruct-2507 on port 47981 srv load: spawning server instance with args: srv load: /root/llama.cpp.pascal/build/bin/llama-server srv load: --host srv load: 127.0.0.1 srv load: --jinja srv load: -kvu srv load: --min-p srv load: 0 srv load: --mlock srv load: --port srv load: 47981 srv load: --temp srv load: 0.7 srv load: --top-k srv load: 20 srv load: --top-p srv load: 0.8 srv load: --alias srv load: MoE-Qwen3-30B-A3B-Instruct-2507 srv load: --ctx-size srv load: 32768 srv load: --cache-type-k srv load: q8_0 srv load: --cache-type-v srv load: q8_0 srv load: --flash-attn srv load: on srv load: --model srv load: mradermacher/Qwen3-30B-A3B-Instruct-2507-i1-GGUF/Qwen3-30B-A3B-Instruct-2507.i1-Q6_K.gguf srv load: --n-gpu-layers srv load: 999 srv load: --parallel srv load: 4 srv log_server_r: request: POST /models/load 127.0.0.1 200

WebUI

Look good! It remains to be tested from HF downloads in .cache/llama.cpp/...

Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/*.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model

@ngxson

PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <[email protected]> Co-authored-by: ngxson <[email protected]>

This reverts commit e3832b4.

Co-authored-by: aldehir <[email protected]>

ServeurpersoCom · 2025-12-10T17:23:07Z

Dead code removed (~40 lines), rebased on latest master.
Working on:

Testing cached HuggingFace models integration
Documentation cleanup (fixing remaining TODOs in PR doc)

ngxson · 2025-12-10T20:13:12Z

Nice, thanks for testing. @ServeurpersoCom unless you have anything else to add, I guess this is good to merge?

ServeurpersoCom · 2025-12-10T20:45:09Z

Nice, thanks for testing. @ServeurpersoCom unless you have anything else to add, I guess this is good to merge?

There's probably still a minor bug :

If I do --models-dir /root/.cache/llama.cpp I get:

OK

But I can't get the model in .cache working from .ini

Not all models have a manifest:

(root|~/.cache/llama.cpp) l total 1607992 -rw-r--r-- 1 root root 1646573056 10 déc. 11:20 ggml-org_Qwen2.5-Coder-1.5B-Q8_0-GGUF_qwen2.5-coder-1.5b-q8_0.gguf -rw-r--r-- 1 root root 66 10 déc. 11:19 ggml-org_Qwen2.5-Coder-1.5B-Q8_0-GGUF_qwen2.5-coder-1.5b-q8_0.gguf.etag (root|~/.cache/llama.cpp)

WDYT ? merge and solve later ?

OK I found the last "problem". Standard -hf work ! don't use --fim-qwen-1.5b-default / --fim-qwen-3b-default *-default CLI to download them !

But with ./llama-server -hf unsloth/Qwen3-0.6B-GGUF:Q6_K and my.ini =

[ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF:qwen2.5-coder-1.5b-q8_0] c = 4096

OK for me !!!

The only strange thing is:
[ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF:qwen2.5-coder-1.5b-q8_0] -> work
[ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF:q8_0] -> not tested !

ServeurpersoCom · 2025-12-10T21:02:45Z

For another issue :
*-default cli : HF downloader don't create the manifest file !

./llama-server --help ... --embd-gemma-default use default EmbeddingGemma model (note: can download weights from the internet) --fim-qwen-1.5b-default use default Qwen 2.5 Coder 1.5B (note: can download weights from the internet) --fim-qwen-3b-default use default Qwen 2.5 Coder 3B (note: can download weights from the internet) --fim-qwen-7b-default use default Qwen 2.5 Coder 7B (note: can download weights from the internet) --fim-qwen-7b-spec use Qwen 2.5 Coder 7B + 0.5B draft for speculative decoding (note: can download weights from the internet) --fim-qwen-14b-spec use Qwen 2.5 Coder 14B + 0.5B draft for speculative decoding (note: can download weights from the internet) --fim-qwen-30b-default use default Qwen 3 Coder 30B A3B Instruct (note: can download weights from the internet) --gpt-oss-20b-default use gpt-oss-20b (note: can download weights from the internet) --gpt-oss-120b-default use gpt-oss-120b (note: can download weights from the internet) --vision-gemma-4b-default use Gemma 3 4B QAT (note: can download weights from the internet) --vision-gemma-12b-default use Gemma 3 12B QAT (note: can download weights from the internet)

ngxson · 2025-12-10T21:14:31Z

OK I found the last "problem". Standard -hf work ! don't use --fim-qwen-1.5b-default / --fim-qwen-3b-default *-default CLI to download them !

these flags don't trigger manifest downloading, so it's easy to understand why the models not get listed

but I think the current preset feature gonna replace these defaults very soon. In other words, the current PR was added to address this exact problem

Edit: maybe not resolve it right away, but the idea is to write better presets in the future

already approved changes related to PEG

@aldehir

* llama-server: recursive GGUF loading Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/*.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model * server : router config POC (INI-based per-model settings) * server: address review feedback from @aldehir and @ngxson PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <[email protected]> Co-authored-by: ngxson <[email protected]> * server: adopt aldehir's line-oriented PEG parser Complete rewrite of INI parser grammar and visitor: - Use p.chars(), p.negate(), p.any() instead of p.until() - Support end-of-line comments (key=value # comment) - Handle EOF without trailing newline correctly - Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*) - Simplified visitor (no pending state, no trim needed) - Grammar handles whitespace natively via eol rule Business validation preserved: - Reject section names starting with LLAMA_ARG_* - Accept only keys starting with LLAMA_ARG_* - Require explicit section before key-value pairs Co-authored-by: aldehir <[email protected]> * server: fix CLI/env duplication in child processes Children now receive minimal CLI args (executable, model, port, alias) instead of inheriting all router args. Global settings pass through LLAMA_ARG_* environment variables only, eliminating duplicate config warnings. Fixes: Router args like -ngl, -fa were passed both via CLI and env, causing 'will be overwritten' warnings on every child spawn * add common/preset.cpp * fix compile * cont * allow custom-path models * add falsey check * server: fix router model discovery and child process spawning - Sanitize model names: replace / and \ with _ for display - Recursive directory scan with relative path storage - Convert relative paths to absolute when spawning children - Filter router control args from child processes - Refresh args after port assignment for correct port value - Fallback preset lookup for compatibility - Fix missing argv[0]: store server binary path before base_args parsing * Revert "server: fix router model discovery and child process spawning" This reverts commit e3832b4. * clarify about "no-" prefix * correct render_args() to include binary path * also remove arg LLAMA_ARG_MODELS_PRESET for child * add co-author for ini parser code Co-authored-by: aldehir <[email protected]> * also set LLAMA_ARG_HOST * add CHILD_ADDR * Remove dead code --------- Co-authored-by: aldehir <[email protected]> Co-authored-by: ngxson <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: aldehir <[email protected]>

mdierolf · 2025-12-12T17:09:55Z

Excellent work, but I have a bit of feedback (currently using the ggml-org:master branch:

When specifying a preset file, the model still loads all the models in the models directory, and there appears to be no way to disable this behavior.

If there is a way to disable this behavior, it shouldn't be the default.

When the models auto-load, it becomes impossible to define aliases for models, since the model name in the ini section header has to match the exact name in the GGUF to override the settings, and the -a param is not available (which makes sense)

The proper behavior should be that either or both the models-dir or the preset option is specified

Also, these flags should probably be named in a manner more consistent with their usage for routing, as models-preset does not convey that behavior.

Perhaps --model-routes-file routes.ini, --auto-route-models [dir] (optional, and accepts multiple options, should default to llama cache dir)

ServeurpersoCom · 2025-12-12T17:40:08Z

@mdierolf Thanks for the feedback. Just to clarify - when you say "the models directory", are you talking about the automatic scanning of ~/.cache/llama.cpp/ (the default cache), or a directory you specified with --models-dir?

--models-dir is optional and only scans if you provide it. But the cache directory is always scanned for HuggingFace models.

If I understand correctly, you'd like the option to use only the models defined in your preset file, without any automatic scanning of the cache? That would mean adding a way to disable cache scanning when using --models-preset ?

mdierolf · 2025-12-12T19:53:15Z

Correct. It is unexpected behavior that passing in a preset file should automatically trigger the scanning of my llama cache directory. I have tons of models stored there, and most of them I do not want to serve

But after thinking about this further, I think the thing that seems off to me, is that this functionality has little overlap with what llama-server does. It should be in a llama-router executable.

llama-server already does way too much in one executable.

ServeurpersoCom · 2025-12-12T20:18:33Z

That's what I think, it makes coding more difficult but I trust ngxson :) it also makes the binary very versatile and performant on small platforms

ngxson · 2025-12-12T20:28:53Z

Correct. It is unexpected behavior that passing in a preset file should automatically trigger the scanning of my llama cache directory. I have tons of models stored there, and most of them I do not want to serve

You can point env var LLAMA_CACHE to another location to make it disappears. We can also add a flag to exclude cached models in the future if needed.

But after thinking about this further, I think the thing that seems off to me, is that this functionality has little overlap with what llama-server does. It should be in a llama-router executable.

We merge everything into llama-server and not 2 separated binaries because:

llama-router depends on llama-server to function
It's easier to ship, especially as one docker image (with least modifications to the workflows)
It's easier for most users especially who aren't very familiar with llama.cpp, to just run llama-server

llama-server already does way too much in one executable.

Just speaking the truth here: Users use it, like it, ask us to add more functions, we add, and now users complain there are too many features. (I can link all issues asking for the serving multiple models feature if you want)

Classic problem with open-source projects I think

mdierolf · 2025-12-14T08:57:32Z

Well, that's one way to do it.

Godspeed and good luck

strawberrymelonpanda · 2025-12-14T23:49:10Z

Just speaking the truth here: Users use it, like it, ask us to add more functions, we add, and now users complain there are too many features

To be fair, "Users" are a very diverse group of people with differing needs, so there's no real way you'll get universal consensus.

mdierolf · 2025-12-15T05:11:14Z

This.

[*]
serve=no

[modeliwant]
serve=yes (or omit, default)

#17948

ServeurpersoCom requested review from ggerganov and ngxson as code owners December 8, 2025 10:23

github-actionsbot added examples server labels Dec 8, 2025

loci-dev mentioned this pull request Dec 8, 2025
UPSTREAM PR #17859: Server: router per model config auroralabs-loci/llama.cpp#486
Open

aldehir previously requested changes Dec 8, 2025
View reviewed changes

tools/server/server-config.cpp Outdated Show resolvedHide resolved
tools/server/server-config.cpp Outdated Show resolvedHide resolved
tools/server/server-config.cpp Outdated Show resolvedHide resolved

ngxson reviewed Dec 8, 2025
View reviewed changes

aldehir reviewed Dec 8, 2025
View reviewed changes

tools/server/server-config.cpp Outdated Show resolvedHide resolved

ngxson reviewed Dec 8, 2025
View reviewed changes

tools/server/server-config.cpp Outdated Show resolvedHide resolved

aldehir reviewed Dec 8, 2025
View reviewed changes

ServeurpersoCom marked this pull request as draft December 8, 2025 11:57

ServeurpersoCom marked this pull request as ready for review December 8, 2025 12:34

ngxson reviewed Dec 8, 2025
View reviewed changes

tools/server/server-models.cppShow resolvedHide resolved

ngxson reviewed Dec 8, 2025
View reviewed changes

ngxson mentioned this pull request Dec 9, 2025
server : run child server on localhost #17878
Closed

ServeurpersoComand others added 3 commits December 10, 2025 18:16

server : router config POC (INI-based per-model settings)
972369e

ngxsonand others added 8 commits December 10, 2025 18:16

Revert "server: fix router model discovery and child process spawning"
a7baeab
This reverts commit e3832b4.

clarify about "no-" prefix
a70419c

correct render_args() to include binary path
97de311

also remove arg LLAMA_ARG_MODELS_PRESET for child
f645e88

add co-author for ini parser code
6bda0d4
Co-authored-by: aldehir <[email protected]>

also set LLAMA_ARG_HOST
035f56a

add CHILD_ADDR
f2ad7dc

Remove dead code
b36b3fe

ServeurpersoCom force-pushed the server/router-per-model-config branch from bf2d94c to b36b3feCompare December 10, 2025 17:16

ngxson approved these changes Dec 10, 2025
View reviewed changes

ngxson changed the title ~~Server: router per model config~~server: add presets (config) when using multiple modelsDec 10, 2025

ngxson merged commit f32ca51 into ggml-org:masterDec 10, 2025
70 of 73 checks passed

ngxson mentioned this pull request Dec 10, 2025
common: support negated args #17919
Merged

ServeurpersoCom mentioned this pull request Dec 11, 2025
Add server-driven parameter defaults and syncing #16515
Merged

loci-dev mentioned this pull request Dec 12, 2025
UPSTREAM PR #17919: common: support negated args auroralabs-loci/llama.cpp#536
Open


		Alternatively, you can also add GGUF based preset (see next section)

		### Model presets

server: add presets (config) when using multiple models#17859

server: add presets (config) when using multiple models #17859

Conversation

ServeurpersoCom commented Dec 8, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Router per-model config

Summary

Motivation

Key Features

Usage

INI Format

How Models Are Loaded

Argument Inheritance

Changes

Technical Details

Use Case Example

Testing Status

Notes

Related Issues

Thanks to

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxsonDec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxsonDec 8, 2025

Choose a reason for hiding this comment

Uh oh!

ServeurpersoComDec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aldehir commented Dec 8, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom commented Dec 8, 2025

Uh oh!

ServeurpersoCom commented Dec 8, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 8, 2025

Uh oh!

ServeurpersoCom commented Dec 8, 2025

Uh oh!

ngxsonDec 8, 2025

Choose a reason for hiding this comment

Uh oh!

ServeurpersoComDec 8, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson commented Dec 8, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxsonDec 8, 2025

Choose a reason for hiding this comment

Uh oh!

ngxsonDec 8, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ServeurpersoComDec 8, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emjomi commented Dec 8, 2025

Uh oh!

ServeurpersoCom commented Dec 8, 2025•
edited
Loading

aldehir commented Dec 8, 2025•
edited
Loading

aldehir left a comment •
edited
Loading

ServeurpersoCom commented Dec 8, 2025•
edited
Loading

ServeurpersoComDec 8, 2025•
edited
Loading

ngxson commented Dec 8, 2025•
edited
Loading

ngxsonDec 8, 2025•
edited
Loading

ServeurpersoComDec 8, 2025•
edited
Loading

ServeurpersoCom commented Dec 10, 2025•
edited
Loading

ServeurpersoCom commented Dec 10, 2025•
edited
Loading

ServeurpersoCom commented Dec 10, 2025•
edited
Loading

ngxson commented Dec 10, 2025•
edited
Loading

ngxson commented Dec 12, 2025•
edited
Loading