- Notifications
You must be signed in to change notification settings - Fork 14.1k
server: add presets (config) when using multiple models#17859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: add presets (config) when using multiple models #17859
Conversation
ServeurpersoCom commented Dec 8, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
ngxson left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks interesting, I will clean this up a bit and push a commit
Uh oh!
There was an error while loading. Please reload this page.
tools/server/server-config.cpp Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be nice if we can move part of this file into common/preset.cpp|h, so it can be reused by other tools
Uh oh!
There was an error while loading. Please reload this page.
tools/server/server-models.cpp Outdated
| if (value == "false"){ | ||
| continue; | ||
| } | ||
| if (value == "true" || value.empty()){ | ||
| child_env.push_back(key + "="); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think leaving the original value for bool should be good? We can already handle these values using is_falsey / is_truthy in arg.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I'll simplify the bool handling to pass through the original values (=true/=false) and let is_truthy/is_falsey handle the conversion
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
aldehir commented Dec 8, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Here is a line-oriented approach for the parser: staticconstauto ini_parser = build_peg_parser([](auto & p){// newline ::= "\r\n" / "\n" / "\r"auto newline = p.rule("newline", p.literal("\r\n") | p.literal("\n") | p.literal("\r")); // ws ::= [ \t]*auto ws = p.rule("ws", p.chars("[ \t]", 0, -1)); // comment ::= [;#] (!newline .)*auto comment = p.rule("comment", p.chars("[;#]", 1, 1) + p.zero_or_more(p.negate(newline) + p.any())); // eol ::= ws comment? (newline / EOF)auto eol = p.rule("eol", ws + p.optional(comment) + (newline | p.end())); // ident ::= [a-zA-Z_] [a-zA-Z0-9_.-]*auto ident = p.rule("ident", p.chars("[a-zA-Z_]", 1, 1) + p.chars("[a-zA-Z0-9_.-]", 0, -1)); // value ::= (!eol-start .)*auto eol_start = p.rule("eol-start", ws + (p.chars("[;#]", 1, 1) | newline | p.end())); auto value = p.rule("value", p.zero_or_more(p.negate(eol_start) + p.any())); // header-line ::= "[" ws ident ws "]" eolauto header_line = p.rule("header-line", "[" + ws + p.tag("section-name", p.chars("[^]]")) + ws + "]" + eol); // kv-line ::= ident ws "=" ws value eolauto kv_line = p.rule("kv-line", p.tag("key", ident) + ws + "=" + ws + p.tag("value", value) + eol); // comment-line ::= ws comment (newline / EOF)auto comment_line = p.rule("comment-line", ws + comment + (newline | p.end())); // blank-line ::= ws (newline / EOF)auto blank_line = p.rule("blank-line", ws + (newline | p.end())); // line ::= header-line / kv-line / comment-line / blank-lineauto line = p.rule("line", header_line | kv_line | comment_line | blank_line); // ini ::= line* EOFauto ini = p.rule("ini", p.zero_or_more(line) + p.end()); return ini});I assume the changes were because of the weirdness in consuming spaces/comments. This should alleviate those concerns. And the visitor can really be something as simple as this: std::map<std::string, std::map<std::string, std::string>> cfg; std::string current_section = "default"; std::string current_key; ctx.ast.visit(result, [&](constauto & node){if (node.tag == "section-name"){current_section = std::string(node.text); cfg[current_section] ={}} elseif (node.tag == "key"){current_key = std::string(node.text)} elseif (node.tag == "value" && !current_key.empty()){cfg[current_section][current_key] = std::string(node.text); current_key.clear()} }); |
aldehir left a comment • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as far as parsing is concerned!
I will need to add an expect() helper to provide helpful error messages to users when they make a mistake. I can do that separately in an another PR.
ServeurpersoCom commented Dec 8, 2025
Now it in a basic working state, with the new line-based PEG parser, I'm testing with my entire per models configuration on the server to test some edge case, and then there are the @ngxson refactoring to do. |
ServeurpersoCom commented Dec 8, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Missing sampling parameters need .set_env() in common/arg.cpp (--temp, --top-p, --top-k, --min-p have no LLAMA_ARG_ env vars yet). Successfully migrated llama-swap config (YAML) to config.ini via LLM: llama-server preserved all custom parameters (ctx-size, n-cpu-moe, mmproj, -m ....Q6_K), applied global CLI defaults (-ngl 999, -fa, --mlock, -ctk/-ctv etc...) to all models, and automatically reorganized sections/keys alphabetically to maintain normalized format |
ngxson commented Dec 8, 2025
Hmm yeah I didn't notice that some env vars are missing. I think it will be cleaner if we default to using the longest arg (for example, Internally, the parser can accept all 3 forms: env, short arg and long arg ; there is no chance that they will collide anyway. I'll push the change for this |
ServeurpersoCom commented Dec 8, 2025
Yes look it just need missing .set_env("LLAMA_ARG_TEMP")); etc... I wait your change while I run some tests |
tools/server/README.md Outdated
| llama-server --models-dir ./models_directory | ||
| ``` | ||
| The directory is scanned recursively, so nested vendor/model layouts such as `vendor_name/model_name/*.gguf` are supported. The model name in the router UI matches the relative path inside `--models-dir` (for example, `vendor_name/model_name`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For visibility, I will remove recursive support from this PR because it's not related to config support - it should be added later via a dedicated PR
ServeurpersoComDec 8, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes no worries, (I have to keep it on my side otherwise it breaks my integration server) -> I would adapt the configuration on my side to test this feature-atomic PR if necessary
Uh oh!
There was an error while loading. Please reload this page.
ngxson commented Dec 8, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I moved most of the code inside We're now using the term "preset", so I think it's easier to make the file name Since I'm now using the same We don't yet support repeated args or args with 2 values (like API endpoint Things that still need to improve:
|
| Alternatively, you can also add GGUF based preset (see next section) | ||
| ### Model presets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ServeurpersoCom I updated the docs with an example - lmk if this works in your case
tools/server/server-models.cpp Outdated
| first_shard_file = file; | ||
| } else{ | ||
| model_file = file; | ||
| std::function<void(const std::string &, const std::string &)> scan_subdir = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the recursive implementation - it's unrelated to the current PR, and it's also unsafe as it doesn't handle the case where there's a circular symlink or circular mount points
ServeurpersoComDec 8, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can retrieve the rest (except the recursion) and push --force; I won't touch the branch before tomorrow/rebase/test.
A two-level browsing system will be perfect for all cases (separate PR)
emjomi commented Dec 8, 2025
Hey guys! Sorry to interrupt, but are the One more thing: maybe it's better to put the config in Thank you so much for what you're doing! |
ServeurpersoCom commented Dec 8, 2025
No worries! with the last refactor the LLAMA_ARG_ prefixes are optional: you can use the short argument forms (e.g., ngl, c) or long forms with dashes (e.g., n-gpu-layers, ctx-size) instead. All three formats are supported. Regarding config location: the preset file path is fully customizable via --models-preset , so you can place it wherever you prefer, including ~/.config/llama.cpp/presets.ini if that fits your workflow better. This is a WIP, I update the first message soon |
ServeurpersoCom commented Dec 10, 2025
I'll update the PR documentation with the new implementation today: no more INI auto-generation, deep GGUF tree support without scanning, all 3 variable formats supported, standard Linux binary/INI relative paths, and --models-dir and --models-preset are now independent |
ServeurpersoCom commented Dec 10, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/*.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model
PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <[email protected]> Co-authored-by: ngxson <[email protected]>
This reverts commit e3832b4.
Co-authored-by: aldehir <[email protected]>
bf2d94c to b36b3feCompareServeurpersoCom commented Dec 10, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Dead code removed (~40 lines), rebased on latest master.
|
ngxson commented Dec 10, 2025
Nice, thanks for testing. @ServeurpersoCom unless you have anything else to add, I guess this is good to merge? |
ServeurpersoCom commented Dec 10, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There's probably still a minor bug : If I do --models-dir /root/.cache/llama.cpp I get: But I can't get the model in .cache working from .ini Not all models have a manifest: WDYT ? merge and solve later ? OK I found the last "problem". Standard -hf work ! don't use --fim-qwen-1.5b-default / --fim-qwen-3b-default *-default CLI to download them ! But with ./llama-server -hf unsloth/Qwen3-0.6B-GGUF:Q6_K and my.ini = ![]() ![]() OK for me !!! The only strange thing is: |
ServeurpersoCom commented Dec 10, 2025
For another issue : |
ngxson commented Dec 10, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
these flags don't trigger manifest downloading, so it's easy to understand why the models not get listed but I think the current preset feature gonna replace these defaults very soon. In other words, the current PR was added to address this exact problem Edit: maybe not resolve it right away, but the idea is to write better presets in the future |
already approved changes related to PEG
f32ca51 into ggml-org:masterUh oh!
There was an error while loading. Please reload this page.
* llama-server: recursive GGUF loading Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/*.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model * server : router config POC (INI-based per-model settings) * server: address review feedback from @aldehir and @ngxson PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <[email protected]> Co-authored-by: ngxson <[email protected]> * server: adopt aldehir's line-oriented PEG parser Complete rewrite of INI parser grammar and visitor: - Use p.chars(), p.negate(), p.any() instead of p.until() - Support end-of-line comments (key=value # comment) - Handle EOF without trailing newline correctly - Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*) - Simplified visitor (no pending state, no trim needed) - Grammar handles whitespace natively via eol rule Business validation preserved: - Reject section names starting with LLAMA_ARG_* - Accept only keys starting with LLAMA_ARG_* - Require explicit section before key-value pairs Co-authored-by: aldehir <[email protected]> * server: fix CLI/env duplication in child processes Children now receive minimal CLI args (executable, model, port, alias) instead of inheriting all router args. Global settings pass through LLAMA_ARG_* environment variables only, eliminating duplicate config warnings. Fixes: Router args like -ngl, -fa were passed both via CLI and env, causing 'will be overwritten' warnings on every child spawn * add common/preset.cpp * fix compile * cont * allow custom-path models * add falsey check * server: fix router model discovery and child process spawning - Sanitize model names: replace / and \ with _ for display - Recursive directory scan with relative path storage - Convert relative paths to absolute when spawning children - Filter router control args from child processes - Refresh args after port assignment for correct port value - Fallback preset lookup for compatibility - Fix missing argv[0]: store server binary path before base_args parsing * Revert "server: fix router model discovery and child process spawning" This reverts commit e3832b4. * clarify about "no-" prefix * correct render_args() to include binary path * also remove arg LLAMA_ARG_MODELS_PRESET for child * add co-author for ini parser code Co-authored-by: aldehir <[email protected]> * also set LLAMA_ARG_HOST * add CHILD_ADDR * Remove dead code --------- Co-authored-by: aldehir <[email protected]> Co-authored-by: ngxson <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: aldehir <[email protected]>
mdierolf commented Dec 12, 2025
Excellent work, but I have a bit of feedback (currently using the ggml-org:master branch: When specifying a preset file, the model still loads all the models in the models directory, and there appears to be no way to disable this behavior. If there is a way to disable this behavior, it shouldn't be the default. When the models auto-load, it becomes impossible to define aliases for models, since the model name in the ini section header has to match the exact name in the GGUF to override the settings, and the -a param is not available (which makes sense) The proper behavior should be that either or both the models-dir or the preset option is specified Also, these flags should probably be named in a manner more consistent with their usage for routing, as models-preset does not convey that behavior. Perhaps --model-routes-file routes.ini, --auto-route-models [dir] (optional, and accepts multiple options, should default to llama cache dir) |
ServeurpersoCom commented Dec 12, 2025
@mdierolf Thanks for the feedback. Just to clarify - when you say "the models directory", are you talking about the automatic scanning of ~/.cache/llama.cpp/ (the default cache), or a directory you specified with --models-dir? --models-dir is optional and only scans if you provide it. But the cache directory is always scanned for HuggingFace models. If I understand correctly, you'd like the option to use only the models defined in your preset file, without any automatic scanning of the cache? That would mean adding a way to disable cache scanning when using --models-preset ? |
mdierolf commented Dec 12, 2025
Correct. It is unexpected behavior that passing in a preset file should automatically trigger the scanning of my llama cache directory. I have tons of models stored there, and most of them I do not want to serve But after thinking about this further, I think the thing that seems off to me, is that this functionality has little overlap with what llama-server does. It should be in a llama-router executable. llama-server already does way too much in one executable. |
ServeurpersoCom commented Dec 12, 2025
That's what I think, it makes coding more difficult but I trust ngxson :) it also makes the binary very versatile and performant on small platforms |
ngxson commented Dec 12, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
You can point env var
We merge everything into
Just speaking the truth here: Users use it, like it, ask us to add more functions, we add, and now users complain there are too many features. (I can link all issues asking for the serving multiple models feature if you want) Classic problem with open-source projects I think |
mdierolf commented Dec 14, 2025
Well, that's one way to do it. Godspeed and good luck |
strawberrymelonpanda commented Dec 14, 2025
To be fair, "Users" are a very diverse group of people with differing needs, so there's no real way you'll get universal consensus. |
mdierolf commented Dec 15, 2025
This. [*] [modeliwant] |



Router per-model config
This PR implements INI-based per-model configuration for llama-server router mode, as discussed in #17850.
Summary
Advanced users can define custom configurations using an .ini file. Each model can have its own preset with custom parameters while inheriting router defaults for unspecified options.
Motivation
Multi-model inference servers for small/medium teams need declarative configuration with zero operational friction. Operators should be able to set global defaults via router CLI and override specific parameters per model in a simple text file.
Key Features
Usage
The router can combine multiple sources:
INI Format
Section names define model identifiers. Keys correspond to CLI arguments without leading dashes.
Supported key formats:
All three formats are equivalent and can be mixed in the same file.
Example presets.ini:
How Models Are Loaded
The router discovers models from three sources:
Model names from presets can match cached or local models to apply custom configurations, or define entirely new models with custom paths.
Argument Inheritance
When spawning a child process for a model, arguments are merged in this order:
Priority order (highest to lowest):
Control args automatically managed by router:
If a preset contains control args, they are removed with a warning.
Changes
New files:
Modified files:
Technical Details
INI parsing:
Argument mapping:
Child process spawning:
Use Case Example
Development team runs inference server with multiple models:
The presets.ini file defines per-model overrides:
Global defaults (-ngl 999 -fa -ctk q8_0 etc.) apply to all models, but each preset can override specific parameters. The router automatically manages ports, aliases, and model paths.
Testing Status
Tested configurations:
Notes
Related Issues
Closes#17850
Related to #17470, #10932
Thanks to
Co-authored-by: aldehir (INI parser PEG grammar)
Co-authored-by: ngxson (llama-server integrated router/model-childs, preset refactoring, API design, argument system integration, ...)