support GLM-4.5V and GLM-4.1V vision models#16600

ddh0 · 2025-10-15T19:09:23Z

Add support for zai-org/GLM-4.5V and zai-org/GLM-4.1V-9B-Thinking vision models to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.

The architecture is Glm4vMoeForConditionalGeneration ("model_type": "glm4v_moe") / Glm4vForConditionalGeneration ("model_type": "glm4v"). Internally, these consist of an LLM (text model) and a ViT (vision adapter / multimodal projector):

LLM

Based on GLM-4.5-Air / GLM-4-9B-0414
Tensor names start with model.language_model.
Uses a "multimodal 3D RoPE" - in apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokens

ViT

Adapted from apple/aimv2-huge-patch14-336:
- Architecture Aimv2VisionModel
- ~681M params
- 24 layers
- hidden_size (n_embd): 1536
- intermediate_size (n_ff): 4096
- image_size: 336
- patch_size: 14
- num_channels: 3
- depth: 24
Tensor names start with model.visual.
Its 2D positional embeddings are dynamically adapted via bicubic interpolation within the Glm4vMoeVisionEmbeddings module to handle varied image resolutions
It also applies its own rotary position embeddings within the self-attention blocks (via apply_rotary_pos_emb_vision)
Same for both models

Other notes:

Native context length is 65,536 (as opposed to 131,072 for GLM-4.5-Air)
RoPE theta (θ): 10,000.0 (as opposed to 100,000.0 for GLM-4.5-Air)
The model supports video input, but I currently do not plan to support video input in this PR (images only)
Tokenizer has video-related special tokens - need to handle these during conversion

References:

The 🤗 reference implementations:
- modeling_glm4v_moe.py
- modular_glm4v_moe.py
The 🤗 model card
The 🤗 config.json

So, it turns out that vision in this model is based on Qwen3-VL, which still needs support from llama.cpp. I am pretty familiar with llama.cpp in general but not with mtmd, so I may not be able to get this PR done on my own. I will keep trying to hack at it when I have time, and I would appreciate any help I could get. :)

Also just saw this thread (#16207) in which someone has posted a patch to get Qwen3-VL kinda-sorta-working in llama.cpp. I will take a look at that too and see if it is helpful - it might make more sense to get Qwen3-VL to a working state in llama.cpp first and only then start working on this PR on top of that. Not sure, just thinking out loud.

rujialiu · 2025-10-18T02:16:21Z

Thanks for your work! @ddh0
Based on the commit history, the imports of qwen3vl is the result of a "use qwen data class to avoid repeat again" refactor, so probably it's not quite "based on Qwen3-VL". But anyway, I'm planning to dive into Qwen3-VL and GLM-4.5V later this month and I hope I can help.

ddh0 · 2025-10-18T20:20:41Z

I'm planning to dive into Qwen3-VL and GLM-4.5V later this month and I hope I can help.

Thank you @rujialiu! I suspect your understanding of the mtmd side of things is better than mine - I could use some guidance on what the next steps should be. I've paused working on this for now until I have a better understanding of what exactly needs to be done.

Also cc @ngxson (llama vision expert :))

rujialiu · 2025-10-19T07:37:18Z

Thank you @rujialiu! I suspect your understanding of the mtmd side of things is better than mine - I could use some guidance on what the next steps should be. I've paused working on this for now until I have a better understanding of what exactly needs to be done.

I have 0 understanding of mtmd before tackling the "inaccurate bbox" issue 😄
Then many people helped me along the way. So let's do/learn things together!

rujialiu · 2025-10-20T02:19:05Z

@ddh0 I asked Claude Sonnet 4.5 to carefully inspect transformers implementation and tell me the differences between Qwen2.5-VL, Qwen3-VL and GLM4V (not a single question. I asked several more specific questions and inspected the codes it gave me). I'm not knowledgable enough to check every detail, but it looks good to me. In short, GLM4V is very close to Qwen2.5-VL:

Same chunked RoPE (though different names: MRople vs 2D-Rope/3D-Rope). glm4v's apply_multimodal_rotary_pos_emb even refers to qwenvl's paper
Same max(t,h,w) logic
Same window attention/patch merging (because it re-uses Qwen2_5_VLVisionAttention and Glm4vVisionPatchMerger. But I haven't checked this carefully)
(new) Learnable embeddings + bicubic interpolation (search Perform bicubic interpolation comment)
Few different constants, like hidden_size, rms_norm_eps etc

It's so similar to Qwen2.5-VL, but why the code re-uses qwen3_vl_moe? It's because Qwen2.5-VL doesn't have moe version 😄
Maybe we only need to wire up GLM4.5V's LLM with its vision encoder in the most obvious way and we're done.

So I guess it's ok to resume the work directly, based on https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix

It should be easy to adapt to whatever "llama_batch improvement" is merged into master later.

BTW: Can we make sure the dense version (GLM-4.1V-9B-Thinking #14495 ) is working first? It's much smaller, easier to compare result with transformers, and it looks like GLM-4.5V is no different besides the LLM part.

ddh0 · 2025-10-20T04:04:25Z

Thank you @rujialiu, that's all very helpful. I will take a look at GLM-4.1V-9B-Thinking and see if it can be incorporated into this PR.

Is there a PR associated with the branch you linked (FMayran:QwenVL-causal-fix)? If not, maybe we should start a separate PR to fix the vision implementation of these families of models in general and then come back to this PR to finish the model-specific parts.

rujialiu · 2025-10-20T04:40:53Z

Thank you @rujialiu, that's all very helpful. I will take a look at GLM-4.1V-9B-Thinking and see if it can be incorporated into this PR.
Is there a PR associated with the branch you linked (FMayran:QwenVL-causal-fix)? If not, maybe we should start a separate PR to fix the vision implementation of these families of models in general and then come back to this PR to finish the model-specific parts.

Of course! Hopefully @ngxson will find some time to fix the general problem (adding an internal token index for casual check). Since you're familiar with LLM part, you can take a look our discussion in #15474 (the quickiest way is to read in a bottom-up order until you understand). The issue and solution is conceptually very simple, but I'm not brave/skillful enough to touch llama-batch.h 😄

FMayran · 2025-10-23T16:11:07Z

Thank you @rujialiu, that's all very helpful. I will take a look at GLM-4.1V-9B-Thinking and see if it can be incorporated into this PR.
Is there a PR associated with the branch you linked (FMayran:QwenVL-causal-fix)? If not, maybe we should start a separate PR to fix the vision implementation of these families of models in general and then come back to this PR to finish the model-specific parts.

now there is: #16745

still need to figure out what exactly needs to be changed...

ngxson · 2025-11-06T21:52:17Z

Its 2D positional embeddings are dynamically adapted via bicubic interpolation within the Glm4vMoeVisionEmbeddings module to handle varied image resolutions
It also applies its own rotary position embeddings within the self-attention blocks (via apply_rotary_pos_emb_vision)

This is essentially the same thing that LFM2 use, you can copy most of the code from this model (already supported by mtmd)

The key difference is in the projection stage, GLM4V uses:

one conv2d to merge output tokens (2x2 square kernel), see: https://github.com/huggingface/transformers/blob/8012f80f722044fd0dda45b4034f89fffc2ff344/src/transformers/models/glm4v/modeling_glm4v.py#L731
then finally project to text embeddings using a simple FFN, see: https://github.com/huggingface/transformers/blob/8012f80f722044fd0dda45b4034f89fffc2ff344/src/transformers/models/glm4v/modeling_glm4v.py#L116

ngxson · 2025-11-06T21:56:16Z

src/models/glm4v.cpp

+
+llm_build_glm4v::llm_build_glm4v(const llama_model & model, const llm_graph_params & params) : llm_graph_context(params){
+ //
+ // TODO -- currently this is just copied from `llm_build_glm4` -- still WIP


normally, text model of "vision" model is just a normal text model, so you probably don't need to add a new arch for it (no need to change anything in main llama.cpp code). only thing need to change is mtmd

Yeah, I am still not 100% sure if we need separate architectures for these vision models or not. The paper mentions:
To further enhance spatial awareness on the language side, we extend RoPE to 3D-RoPE in the LLM.
I think what they're referring to as "3D-RoPE in the LLM" is actually M-RoPE, which glm4 and glm4_moe do not use.
Maybe M-RoPE could be conditionally incorporated into the existing llm_build_glm4 and llm_build_glm4_moe graph, but I thought it would be cleaner for the implementation of the vision models to be separate. I also did it this way following the pattern of Qwen3 / Qwen3VL being separate, as I think GLM is not too dissimilar from those.

also renamed `glm4v_moe.cpp` to `glm4v-moe.cpp` to match other model files

fuutott · 2025-12-08T13:50:20Z

https://huggingface.co/zai-org/GLM-4.6V

eitelnick · 2025-12-09T16:41:14Z

Can you also add to Ollama cloud with vision / multimodel support?

eelbaz · 2025-12-11T19:19:03Z

Almost there... In progress for community review..

eelbaz · 2025-12-12T15:27:31Z

GLM Support has landed and open for review: #17967 - Enjoy!

ngxson · 2025-12-13T15:33:17Z

@ddh0 are you still actively working on this PR? I'll have a look in upcoming days

ngxson · 2025-12-13T15:52:51Z

Nevemind, this PR doesn't have the tensor_mapping for mmproj that I need, so probably better for me to start from zero

ddh0 · 2025-12-13T17:06:03Z

@ddh0 are you still actively working on this PR? I'll have a look in upcoming days

No, I sort of got stuck and wasn't sure how to proceed, and I also got a job so I have less free time now. As I'm sure you know there have been some MRoPE fixes/additions since I first started this PR so there is probably something I'm missing.

Nevemind, this PR doesn't have the tensor_mapping for mmproj that I need, so probably better for me to start from zero

Sure, I would appreciate it if you took over, you know what you're doing more than I do.

eelbaz · 2025-12-13T17:18:41Z

@ddh0 — Thanks for the great starting point, your work was super helpful! I just tried to contribute a working path to unblock the community while official maintainer approved support for glm4.6v is in consideration/progress. Please feel free to use in any manner #17998 (the take-or-leave code works), if helpful. Apologies for the noise/thrash!

ddh0 added 5 commits October 15, 2025 14:02

initial commit for branch glm45v
8448b23

use F32 accumulators for GLM4V_MOE
70c8686

add arch
631d4fa

llama-model : add placeholders
2aa6985

fix arch name for tensor names
d0e9dce

github-actionsbot added the python python script changes label Oct 15, 2025

ddh0 mentioned this pull request Oct 15, 2025
support GLM-4.5V (108B VLM) ddh0/llama.cpp#3
Closed

ddh0 added 8 commits October 15, 2025 14:33

Merge branch 'ggml-org:master' into glm45v-2
0a72591

WIP conversion logic
01d085d

Merge branch 'ggml-org:master' into glm45v-2
5633947

better class names
14cee9c

add clip.vision.rope.* to GGUF constants
e0b6064
need `clip.vision.rope.freq_base` for GLM-4.5V

add add_vision_rope_freq_base for GGUF metadata
7bdc330

set clip.vision.rope.freq_base during conversion
ed7c271

Merge branch 'ggml-org:master' into glm45v-2
9e9a4a8

Merge branch 'ggml-org:master' into glm45v-2
a41109d

rujialiu mentioned this pull request Oct 18, 2025
Fix incorrect causual attention mask caused by M-Rope #15474
Closed

Merge branch 'ggml-org:master' into glm45v-2
9d77113

Merge branch 'ggml-org:master' into glm45v-2
9058117

ddh0 mentioned this pull request Oct 27, 2025
[model] add support for qwen3vl series #16780
Merged

ddh0 added 2 commits October 27, 2025 00:32

Merge branch 'master' into glm45v-2
1ed15dc

Merge branch 'master' into glm45v-2
fd6236f

ddh0 changed the title ~~support GLM-4.5V vision model~~support GLM-4.5V and GLM-4.1V vision modelsNov 5, 2025

ddh0 added 4 commits November 4, 2025 23:29

fix typo
484d18c

add GLM4V arch tensor map
c84a431

fix typo
7267e8a

add glm4v and glm4v_moe to src/models
ac54c71

github-actionsbot added the model Model specific label Nov 5, 2025

ddh0and others added 5 commits November 5, 2025 14:49

Merge branch 'ggml-org:master' into glm45v-2
e3009de

revert old RoPE GGUF changes
c17d4b9

begin adding GLM4V projector
8a6ad0c

copy LLM graph code from text models (WIP)
f39b231
still need to figure out what exactly needs to be changed...

consistent arch naming
b60c16a

github-actionsbot added the examples label Nov 5, 2025

WIP conversion logic
6443ecb

ngxson reviewed Nov 6, 2025
View reviewed changes

ddh0and others added 3 commits November 8, 2025 14:40

Merge branch 'ggml-org:master' into glm45v-2
977c9e3

include glm4v-moe.cpp and glm4v.cpp in CMake
eb2c8b8
also renamed `glm4v_moe.cpp` to `glm4v-moe.cpp` to match other model files

mtmd : WIP build_glm4v cgraph
b37d326

pwilkin added the help wanted Needs help from the community label Dec 8, 2025

rick-github mentioned this pull request Dec 9, 2025
Model Request GLM-4.6V ollama/ollama#13391
Open

eelbaz mentioned this pull request Dec 12, 2025
mtmd, llama: add GLM4V vision-language model support #17967
Closed

rick-github mentioned this pull request Dec 13, 2025
Model request: AutoGLM-Phone-9B ollama/ollama#13433
Open

support GLM-4.5V and GLM-4.1V vision models#16600

Are you sure you want to change the base?

support GLM-4.5V and GLM-4.1V vision models #16600

Uh oh!

Conversation

ddh0 commented Oct 15, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LLM

ViT

Other notes:

References:

See also:

Uh oh!

ddh0 commented Oct 17, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rujialiu commented Oct 18, 2025

Uh oh!

ddh0 commented Oct 18, 2025

Uh oh!

rujialiu commented Oct 19, 2025

Uh oh!

rujialiu commented Oct 20, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Oct 20, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rujialiu commented Oct 20, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FMayran commented Oct 23, 2025

Uh oh!

ngxson commented Nov 6, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxsonNov 6, 2025

Choose a reason for hiding this comment

Uh oh!

ddh0Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

fuutott commented Dec 8, 2025

Uh oh!

eitelnick commented Dec 9, 2025

Uh oh!

eelbaz commented Dec 11, 2025

Uh oh!

eelbaz commented Dec 12, 2025

Uh oh!

ngxson commented Dec 13, 2025

Uh oh!

ngxson commented Dec 13, 2025

Uh oh!

ddh0 commented Dec 13, 2025

Uh oh!

eelbaz commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ddh0 commented Oct 15, 2025•
edited
Loading

ddh0 commented Oct 17, 2025•
edited
Loading

rujialiu commented Oct 20, 2025•
edited
Loading

ddh0 commented Oct 20, 2025•
edited
Loading

rujialiu commented Oct 20, 2025•
edited
Loading

ngxson commented Nov 6, 2025•
edited
Loading