- Notifications
You must be signed in to change notification settings - Fork 14.2k
llama : second attempt to refactor vision API#11292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading. Please reload this page.
Conversation
ngxson commented Jan 18, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
ngxson commented Jan 19, 2025
Hi @ggerganov@slaren , I would like to ask for an early review from you before proceeding further. What will be interesting to discuss here is the usage of the new API, as demo in the newly added
I'm already be able to make llava and mobilevlm working with Things that are different from the initial discussion in #8010 :
And things that are still messy and will need more works:
I would love to hear your opinions about this. Thank you! |
| if (ctx.ctx_ggml){ | ||
| ggml_free(ctx.ctx_ggml); | ||
| } | ||
| ggml_init_params params ={ | ||
| /*.mem_size =*/ggml_tensor_overhead(), | ||
| /*.mem_buffer =*/NULL, | ||
| /*.no_alloc =*/true, | ||
| }; | ||
| ctx.ctx_ggml = ggml_init(params); | ||
| ctx.output = ggml_dup_tensor(ctx.ctx_ggml, output_node); | ||
| ggml_backend_alloc_ctx_tensors_from_buft(ctx.ctx_ggml, ctx.model->buft); | ||
| ggml_backend_tensor_copy(output_node, ctx.output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slaren Not sure if there is a better way, but I'm using a hacky solution here.
Without a dedicated context (and ggml_backend_tensor_copy), the underlay buffer is realloc before the next llama_decode, rendering the data unusable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the vision part uses the same scheduler than the llama_context, that's unavoidable. You could pre-allocate the tensor in a different buffer to avoid the copy, but that's an optimization that can be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a separate encoder context for the clip model, the decoder context could reference tensors from it directly. They would be interpreted as inputs for the decoder.
slaren commented Jan 20, 2025
I am just wondering, is there any reason to expose the patches/slices to the user at all? Can the user do anything with the patches other than just immediately call |
danbev commented Jan 20, 2025
@ngxson I'll take a closer look at this today and specifically how about how this could work with a cross-attention model like Llama 3.2 Vision 👍 One thing that is related to this work is something we discussed about how these models should be provided. I initially though that creating a single .gguf for Llama 3.2 which contained both the vision encoder and the language model would be the way to go, but as can be read in the linked discussion having separate models is probably a better solution. It would be great to get some clarification regarding this and if vision encoders should be separate .gguf models. |
ngxson commented Jan 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
@slaren In my first proposal, I made
|
ngxson commented Jan 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Btw I have been repeatedly mentioned about |
ggerganov left a comment • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some thoughts that I have so far.
Continuing along the idea for having separate models and contexts for the encoder and the decoder. I think with proper llama_batch abstraction we can have the following API:
// visionpatches0=llama_vision_tokenize(ctx_enc_v, img0); patches1=llama_vision_tokenize(ctx_enc_v, img1); llama_batch_add_image(batch_enc_v, patches0); llama_batch_add_image(batch_enc_v, patches1); llama_encode(ctx_enc_v, batch_enc_v); embd_enc_v=llama_get_embeddings(ctx_enc_v); // audiomel0=llama_audio_tokenize(ctx_enc_a, audio0); mel1=llama_audio_tokenize(ctx_enc_a, audio1); llama_batch_add_audio(batch_enc_a, mel0); llama_batch_add_audio(batch_enc_a, mel1); llama_encode(ctx_enc_a, batch_enc_a); embd_enc_a=llama_get_embeddings(ctx_enc_a); // text + vision + audiotokens0=llama_tokenize(ctx_dec, tokens0); tokens1=llama_tokenize(ctx_dec, tokens1); llama_batch_add_text (batch_dec, tokens0); llama_batch_add_embd_image(batch_dec, embd_enc_v); llama_batch_add_embd_audio(batch_dec, embd_enc_a); llama_batch_add_text (batch_dec, tokens1); llama_decode(ctx_dec, batch_dec);For cross-attention models such as Llama 3.2 Vision and Whisper, the decoding context ctx_dec could be initialized with a reference to the encoder context:
llama_context_paramscparams_dec; cparams_dec.ctx_cross[0] =ctx_enc_v; cparams_dec.ctx_cross[1] =ctx_enc_a;Edit: extended the example with audio input as well.
Uh oh!
There was an error while loading. Please reload this page.
src/llama-vision.cpp Outdated
| static ggml_cgraph * clip_image_build_graph(clip_context & ctx, int batch_size, clip_image_size & image_size){ | ||
| auto & model = *ctx.model; | ||
| auto & hparams = ctx.model->hparams; | ||
| constint hidden_size = hparams.hidden_size; | ||
| constint n_head = hparams.n_head; | ||
| constint d_head = hidden_size / n_head; | ||
| constint patch_size = hparams.patch_size; | ||
| constfloat eps = hparams.eps; | ||
| constint num_patches = ((image_size.width / patch_size) * (image_size.height / patch_size)); | ||
| constint num_positions = num_patches + (model.class_embedding ? 1 : 0); | ||
| LLAMA_LOG_DEBUG("%s: num_patches = %d\n", __func__, num_patches); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clip graph should be constructed as any other graph in src/llama.cpp, llm_build_context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to do this right now, as I can't see how I can re-use existing build_* to make the cgraph of vision models "blend-in" with the rest of llm_build_context
But what I did so far is to make an equivalent called llama_vision_graph_builder. This meant to be a temporary solution, to simplify the migration in the near future.
Could you please have a look on my llama_vision_graph_builder to see how it can be merged into llm_build_context? Thanks!
src/llama-vision.cpp Outdated
| delete p; | ||
| } | ||
| int32_tllama_vision_encode(structllama_context * ctx, llama_vision_patches * p){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we need separate function - we should be able to reuse llama_encode.
ngxsonJan 21, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't think we can do this right now, as it requires llama_batch to also accept image tokens.
Do you think it's ok to keep llama_vision_encode(llama_img_tokens &) and refactor llama_batch later on?
| if (ctx.ctx_ggml){ | ||
| ggml_free(ctx.ctx_ggml); | ||
| } | ||
| ggml_init_params params ={ | ||
| /*.mem_size =*/ggml_tensor_overhead(), | ||
| /*.mem_buffer =*/NULL, | ||
| /*.no_alloc =*/true, | ||
| }; | ||
| ctx.ctx_ggml = ggml_init(params); | ||
| ctx.output = ggml_dup_tensor(ctx.ctx_ggml, output_node); | ||
| ggml_backend_alloc_ctx_tensors_from_buft(ctx.ctx_ggml, ctx.model->buft); | ||
| ggml_backend_tensor_copy(output_node, ctx.output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a separate encoder context for the clip model, the decoder context could reference tensors from it directly. They would be interpreted as inputs for the decoder.
src/llama-vision.cpp Outdated
| structllama_vision_patches * llama_vision_patches_init( | ||
| structllama_context * ctx, | ||
| llama_vision_bitmap * bmp){ | ||
| clip_context & vctx = ctx->vctx; | ||
| if (vctx.model->hparams.arch == VISION_ARCH_MINICPMV){ | ||
| returnnewllama_vision_patches(clip_image_preprocess_minicpmv(vctx, *bmp)); | ||
| } | ||
| returnnewllama_vision_patches(clip_image_preprocess(vctx, *bmp)); | ||
| } |
ggerganovJan 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the analogy of "tokenization" in the context of vision models is the conversion of "images -> patches". So the patches could be considered as "image tokens" and it seems reasonable to have a separate function to create patches, since this would have to be performed on the CPU.
I am just wondering, is there any reason to expose the patches/slices to the user at all? Can the user do anything with the patches other than just immediately call llama_vision_encode and throw them away? If not, then maybe that could be hidden entirely from the user and llama_vision_encode could take directly an image.
Even though the user cannot explicitly operate with the patches, it seems to make sense to expose this in order to be able to multi-thread the pre-processing step.
Note that we should also consider the case of Whisper in the context of this abstraction. The whisper model takes raw input audio in PCM format, which is first pre-processed into a mel spectrogram. This pre-processing step, similar to the image pre-processing for CLIP and the text tokenization in text models, is performed on the CPU and can be multi-threaded. Of course, any of the three types of pre-processings could be implemented on the GPU with enough effort, but the important aspect is that this pre-processing can be done in parallel for different inputs and once computed, can be reused with different contexts.
In all cases, the pre-processed input is passed to the transformer graph and the first step is always to convert this input in embeddings. For text, this conversion is trivial - ggml_get_rows(w, tokens). For Whisper, this processes involves a couple of convolutions of the mel spectrogram:
For CLIP, this appears to be again a convolution operator applied to the pre-processed input (the image patches) in order to obtain the initial embeddings:
All these conversions of the pre-processed input (tokens, mel, patches) into the initial embeddings should be implemented in a single place: build_inp_embd().
ngxsonJan 20, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the analogy of "tokenization" in the context of vision models is the conversion of "images -> patches". So the patches could be considered as "image tokens" and it seems reasonable to have a separate function to create patches
Make sense then. I realized that I was always associate the notion of "token" with "text", but a quick google search tells that: "In LLMs, a token is a basic unit of input or output [...]"
In that sense, I would propose calling it llama_vision_img_tokens (though, it can be a bit confused because user may expect it a std::vector due to the plural "tokens")
// Structure represents the basic input unit of vision model// This can be a processed image or slices of images under the hoodstructllama_vision_img_tokens; // User must reserve N number of tokens in tokenized text prompt for each imageint32_tllama_vision_get_n_tokens(const llama_vision_img_tokens * img_tokens);Uh oh!
There was an error while loading. Please reload this page.
danbev commented Jan 22, 2025
@ngxson Sorry about the delay. I've been able to "force" support for mllama using the latest vision api, that is get an example working. I'm now going to iterate on this and try to figure out how cross attention will work. Just wanted to let you know that some progress is being made. There is an issue I'm having with the vocab size which I'm not exactly sure how to handle. If anyone has some thoughts around this please let me know. |
ngxson commented Jan 22, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
@danbev No worries, I was busy with minicpm-v too. It's still not working now (inference works, but just missing the llava-uhd preprocessor). Will have a look on your implementation of mllama very soon. |
ngxson commented Jan 22, 2025
So, minicpm-v template is more complicated because it contains bot the image and all the slices. Here is what it looks like in
To get rid of this complication, my idea is to have the embeddings of these tokens ( This will make this formatting transparent to the text tokenizer, but will require embeddings of these tokens to be stored as one-hot vectors in the vision model (of course we can use |
ngxson commented Jan 23, 2025
Ok so I managed to get minicpm-v kinda work out of the box with the API (no changes to user-space code is required). Upon giving it win XP wallpaper bliss, it says: It currently operates with a resized version of the image (like llava), so the performance will be bad for bigger images (with more details). I'll get llava-uhd to work, which breaks the image into slices and thus allow the LLM to "see" the image at different zoom level, thus preserving details. |
This comment was marked as resolved.
This comment was marked as resolved.
ngxson commented Mar 1, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Try another image / format / resolution. I'd recommend you pinpoint problem on your side first, to prevent spamming this thread with too much data. And again, nothing is guaranteed to work. This is a WIP. (I hide your comments because they take too much space, it's hard for me to follow) |
AIWintermuteAI commented Mar 1, 2025
Sure, no worries! I'll use collapsible text next time I need to post large logs, thanks for the reminder. I'll try testing with some more images and I guess see what can be done about |
AIWintermuteAI commented Mar 1, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Update: Then everything works! |
ngxson commented Mar 1, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
OK so Phi-4-multimodal-instruct is a bit more messy. Traditional vision model are simple: just 2 separated transformers, one for vision encoder and one for language decoder. However, on Phi-4 embedding data from vision/audio encoder must also be processed using a dedicated LoRA adapter applied on top of the language decoder Very technical detailsNormal vision models: flowchart TD image --> vision_transformer vision_transformer[[vision_transformer]] --> embd_input text_input --> embd_input embd_input --> text_transformer[[text_transformer]] text_transformer --> text_output Phi-4 multimodal: flowchart TD image --> vision_transformer[[vision_transformer]] vision_transformer --> embd_input audio --> audio_transformer[[audio_transformer]] audio_transformer --> embd_input text_input --> embd_input embd_input --> text_transformer subgraph text_transformer vision_LoRA[[vision_LoRA]] audio_LoRA[[audio_LoRA]] base_model[[base_model]] end text_transformer --> text_output Diagram from the paper: For now, I've been able to convert only the text/language part. Turns out, it just a simple Phi-4-mini-instruct under the hood, so nothing interesting. This is also mentioned in the paper:
Update: the LoRA part is very complicated to implement right now, so it will be for a dedicated research/PR in the future. revert Phi-4-mm since we cannot support LoRA for now, too complicated |
lucasjinreal commented Mar 9, 2025
Hello, does qwen2.5 vl conversion script from raw safetensors into GGUF supported now? Also curious what's the standared way to support a new model in convert_hf_to_gguf.py, it's looks like a little bit tricky needs handle very specific tensor names in various model arch |
matheusfrancisco commented Apr 2, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
are here any intention to support unsloth/Llama-3.2-11B-Vision-Instruct?, |


Fix#8010
Supersede #9687
Important
Please do NOT upload gguf produced via this PR on the internet. Other people don't know how to use it and they will complain
Then,
Goals of this PR:
llama_visionProcessorclass on HF libraryThings that will be done in follow-up PRs: