- Notifications
You must be signed in to change notification settings - Fork 14.1k
support GLM-4.5V and GLM-4.1V vision models#16600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading. Please reload this page.
Conversation
ddh0 commented Oct 15, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
need `clip.vision.rope.freq_base` for GLM-4.5V
ddh0 commented Oct 17, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
So, it turns out that vision in this model is based on Qwen3-VL, which still needs support from llama.cpp. I am pretty familiar with llama.cpp in general but not with Also just saw this thread (#16207) in which someone has posted a patch to get Qwen3-VL kinda-sorta-working in llama.cpp. I will take a look at that too and see if it is helpful - it might make more sense to get Qwen3-VL to a working state in llama.cpp first and only then start working on this PR on top of that. Not sure, just thinking out loud. |
rujialiu commented Oct 18, 2025
Thanks for your work! @ddh0 |
ddh0 commented Oct 18, 2025
Thank you @rujialiu! I suspect your understanding of the Also cc @ngxson (llama vision expert :)) |
rujialiu commented Oct 19, 2025
I have 0 understanding of |
rujialiu commented Oct 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
@ddh0 I asked Claude Sonnet 4.5 to carefully inspect
It's so similar to Qwen2.5-VL, but why the code re-uses qwen3_vl_moe? It's because Qwen2.5-VL doesn't have moe version 😄 So I guess it's ok to resume the work directly, based on https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix It should be easy to adapt to whatever "llama_batch improvement" is merged into BTW: Can we make sure the dense version (GLM-4.1V-9B-Thinking #14495 ) is working first? It's much smaller, easier to compare result with |
ddh0 commented Oct 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Thank you @rujialiu, that's all very helpful. I will take a look at GLM-4.1V-9B-Thinking and see if it can be incorporated into this PR. Is there a PR associated with the branch you linked ( |
rujialiu commented Oct 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Of course! Hopefully @ngxson will find some time to fix the general problem (adding an internal token index for casual check). Since you're familiar with LLM part, you can take a look our discussion in #15474 (the quickiest way is to read in a bottom-up order until you understand). The issue and solution is conceptually very simple, but I'm not brave/skillful enough to touch |
FMayran commented Oct 23, 2025
now there is: #16745 |
still need to figure out what exactly needs to be changed...
ngxson commented Nov 6, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
This is essentially the same thing that LFM2 use, you can copy most of the code from this model (already supported by mtmd) The key difference is in the projection stage, GLM4V uses:
|
| llm_build_glm4v::llm_build_glm4v(const llama_model & model, const llm_graph_params & params) : llm_graph_context(params){ | ||
| // | ||
| // TODO -- currently this is just copied from `llm_build_glm4` -- still WIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normally, text model of "vision" model is just a normal text model, so you probably don't need to add a new arch for it (no need to change anything in main llama.cpp code). only thing need to change is mtmd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am still not 100% sure if we need separate architectures for these vision models or not. The paper mentions:
To further enhance spatial awareness on the language side, we extend RoPE to 3D-RoPE in the LLM.
I think what they're referring to as "3D-RoPE in the LLM" is actually M-RoPE, which glm4 and glm4_moe do not use.
Maybe M-RoPE could be conditionally incorporated into the existing llm_build_glm4 and llm_build_glm4_moe graph, but I thought it would be cleaner for the implementation of the vision models to be separate. I also did it this way following the pattern of Qwen3 / Qwen3VL being separate, as I think GLM is not too dissimilar from those.
also renamed `glm4v_moe.cpp` to `glm4v-moe.cpp` to match other model files
fuutott commented Dec 8, 2025
eitelnick commented Dec 9, 2025
Can you also add to Ollama cloud with vision / multimodel support? |
eelbaz commented Dec 11, 2025
eelbaz commented Dec 12, 2025
GLM Support has landed and open for review: #17967 - Enjoy! |
ngxson commented Dec 13, 2025
@ddh0 are you still actively working on this PR? I'll have a look in upcoming days |
ngxson commented Dec 13, 2025
Nevemind, this PR doesn't have the tensor_mapping for mmproj that I need, so probably better for me to start from zero |
ddh0 commented Dec 13, 2025
No, I sort of got stuck and wasn't sure how to proceed, and I also got a job so I have less free time now. As I'm sure you know there have been some MRoPE fixes/additions since I first started this PR so there is probably something I'm missing.
Sure, I would appreciate it if you took over, you know what you're doing more than I do. |
eelbaz commented Dec 13, 2025
@ddh0 — Thanks for the great starting point, your work was super helpful! I just tried to contribute a working path to unblock the community while official maintainer approved support for glm4.6v is in consideration/progress. Please feel free to use in any manner #17998 (the take-or-leave code works), if helpful. Apologies for the noise/thrash! |

Add support for zai-org/GLM-4.5V and zai-org/GLM-4.1V-9B-Thinking vision models to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.
The architecture is
Glm4vMoeForConditionalGeneration("model_type": "glm4v_moe") /Glm4vForConditionalGeneration("model_type": "glm4v"). Internally, these consist of an LLM (text model) and a ViT (vision adapter / multimodal projector):LLM
model.language_model.apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokensViT
Aimv2VisionModelmodel.visual.Glm4vMoeVisionEmbeddingsmodule to handle varied image resolutionsapply_rotary_pos_emb_vision)Other notes:
References:
See also: