Skip to content

Conversation

@infil00p
Copy link

This commit adds support for the R-4B model (YannQi/R-4B), a multimodal large language model with auto-thinking capabilities.

Changes:

  • convert_hf_to_gguf.py: Added RVisionModel and RTextModel classes to handle the R model architecture (RForConditionalGeneration)

    • RVisionModel uses LFM2 projector type with scale_factor=1 (no patch merging)
    • RTextModel extends Qwen3Model for the language component
    • Proper tensor name mapping for the projector (pre_norm, linear_1, linear_2)
  • tools/mtmd/clip.cpp: Modified build_patch_merge_permute() to support scale_factor=1, which skips patch merging for models that don't need it

    • R model uses SigLIP vision encoder with 729 tokens (27x27 patches)
    • Projector: LayerNorm → Linear → GELU → Linear (no patch downsampling)

Architecture:

  • Base text model: Qwen3-4B
  • Vision encoder: SigLIP (384x384, patch size 14)
  • Projector: 2-layer MLP with pre-normalization (no patch merging)
  • Feature selection: full (keeps all 729 vision tokens)

Tested with llama-mtmd-cli and successfully generates English responses with Chinese internal reasoning ( tags).

Make sure to read the contributing guidelines before submitting a PR

This commit adds support for the R-4B model (YannQi/R-4B), a multimodal large language model with auto-thinking capabilities. Changes: - convert_hf_to_gguf.py: Added RVisionModel and RTextModel classes to handle the R model architecture (RForConditionalGeneration) - RVisionModel uses LFM2 projector type with scale_factor=1 (no patch merging) - RTextModel extends Qwen3Model for the language component - Proper tensor name mapping for the projector (pre_norm, linear_1, linear_2) - tools/mtmd/clip.cpp: Modified build_patch_merge_permute() to support scale_factor=1, which skips patch merging for models that don't need it - R model uses SigLIP vision encoder with 729 tokens (27x27 patches) - Projector: LayerNorm → Linear → GELU → Linear (no patch downsampling) Architecture: - Base text model: Qwen3-4B - Vision encoder: SigLIP (384x384, patch size 14) - Projector: 2-layer MLP with pre-normalization (no patch merging) - Feature selection: full (keeps all 729 vision tokens) Tested with llama-mtmd-cli and successfully generates English responses with Chinese internal reasoning (<think> tags).
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examplespythonpython script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

@infil00p