Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxsonngxson commented Dec 12, 2025

The goal of this PR is to allow more audio pre-processing mechanism to be added into mtmd

While the code is not very clean, this should already allow:


Key points

  • Each model's preprocessor now have their own subclass extended from mtmd_audio_preprocessor
  • Preprocessor can access hparams directly (to read audio params like n_mel, n_fft, etc)
  • Each preprocessor also have its own initialize() function which will be called on model load, to initialize global cache entries like sin/cos, hann window
  • Filter bank is now constructed dynamically thanks to @tdakhran 's implementation of fill_mel_filterbank_matrix (the hard-coded value is now removed)

@ngxson
Copy link
CollaboratorAuthor

ngxson commented Dec 12, 2025

Hmm, I think I can also upstream some changes from #17694 , that would make your PR a bit shorter @tdakhran

I will remove the pre-calculated filters and replace with your version

Edit: since my goal is to implement conformer, I think I will end up copying a lot of code and refactor them along the way

@ngxsonngxson marked this pull request as draft December 12, 2025 23:24
@ngxsonngxson marked this pull request as ready for review December 13, 2025 14:06
@ngxson
Copy link
CollaboratorAuthor

ngxson commented Dec 13, 2025

@ggerganov This is ready for review. I only have basic knowledge about signal/audio processing, would appreciate if you can have a deeper look to see if things are still correct compared to the original code from whisper.cpp

Note: this change also contain enough code for LFM2-audio and gemma 3n audio preprocessor

Test results:

[audio] OK: ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0 [audio] OK: ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M [audio] OK: ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M 

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

@ngxson