mtmd: refactor audio preprocessing#17978

ngxson · 2025-12-12T22:58:19Z

The goal of this PR is to allow more audio pre-processing mechanism to be added into mtmd

While the code is not very clean, this should already allow:

Simplify model : add ASR support for LFM2-Audio-1.5B #17694
Potentially support gemma 3n audio

Key points

Each model's preprocessor now have their own subclass extended from mtmd_audio_preprocessor
Preprocessor can access hparams directly (to read audio params like n_mel, n_fft, etc)
Each preprocessor also have its own initialize() function which will be called on model load, to initialize global cache entries like sin/cos, hann window
Filter bank is now constructed dynamically thanks to @tdakhran 's implementation of fill_mel_filterbank_matrix (the hard-coded value is now removed)

ngxson · 2025-12-12T23:23:53Z

Hmm, I think I can also upstream some changes from #17694 , that would make your PR a bit shorter @tdakhran

I will remove the pre-calculated filters and replace with your version

Edit: since my goal is to implement conformer, I think I will end up copying a lot of code and refactor them along the way

Co-authored-by: Tarek <[email protected]>

ngxson · 2025-12-13T14:18:02Z

@ggerganov This is ready for review. I only have basic knowledge about signal/audio processing, would appreciate if you can have a deeper look to see if things are still correct compared to the original code from whisper.cpp

Note: this change also contain enough code for LFM2-audio and gemma 3n audio preprocessor

Test results:

[audio] OK: ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0 [audio] OK: ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M [audio] OK: ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M

mtmd: refactor audio preprocessing
7cc2cf1

ngxson requested a review from ggerganov December 12, 2025 22:58

ngxson marked this pull request as draft December 12, 2025 23:24

github-actionsbot added the examples label Dec 13, 2025

ngxsonand others added 5 commits December 13, 2025 13:28

refactor
9e2cd84
Co-authored-by: Tarek <[email protected]>

wip
cea1a90

wip (2)
7b578b5

improve constructor
4b63e61

fix use_natural_log
93290e5

ngxson marked this pull request as ready for review December 13, 2025 14:06

fix padding for short input
1aaec3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd: refactor audio preprocessing#17978

mtmd: refactor audio preprocessing #17978

ngxson commented Dec 12, 2025•
edited
Loading

Uh oh!

ngxson commented Dec 12, 2025•
edited
Loading

Uh oh!

ngxson commented Dec 13, 2025•
edited
Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mtmd: refactor audio preprocessing#17978

Are you sure you want to change the base?

mtmd: refactor audio preprocessing #17978

Conversation

ngxson commented Dec 12, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key points

Uh oh!

ngxson commented Dec 12, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 13, 2025• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ngxson commented Dec 12, 2025•
edited
Loading

ngxson commented Dec 12, 2025•
edited
Loading

ngxson commented Dec 13, 2025•
edited
Loading