Skip to content

Conversation

@ngdxzy
Copy link

@ngdxzyngdxzy commented Dec 12, 2025

Description:

This PR implements true Q8_0 quantization for the Hexagon NPU backend, building on and integrating substantial previous work, to align its behavior with the CPU implementation in llama.cpp and improve the numerical accuracy of mixed-precision matmul operations.

Background:

In the current Hexagon NPU pipeline, quantization is performed on-the-fly during matrix multiplication, where FP32 activations are quantized and multiplied with already quantized weights. As a result, the quantization group size directly impacts the numerical behavior of these mixed-precision matmul operations, making alignment with the CPU Q8_0 scheme particularly important for correctness.

Previously, the Hexagon backend only supported quantize_block_fp32_q8x4, which uses a group size of 128. While functional, this does not match the standard Q8_0 definition used by the CPU backend in llama.cpp, leading to accuracy differences.

What's new:

  1. Implemented true Q8_0 quantization kernels with smaller group sizes:
  • quantize_block_fp32_q8x1 with group size 32
  • quantize_block_fp32_q8x2 with group size 64
  1. Retained the original quantize_block_fp32_q8x4 implementation (group size 128) for compatibility and performance comparisons.
  2. Introduced a function–pointer–based dispatch mechanism to select the Q8 quantization kernel at runtime.
  • Enables dynamic switching between q8x1 / q8x2 / q8x4 without code duplication.
  • Facilitates future debugging, validation, and accuracy/performance trade-off studies.
  • Allows easier experimentation with different group sizes while keeping the call sites unchanged.
  1. Aligned scale computation and quantization behavior with the CPU Q8_0 implementation in llama.cpp.

Why this matters:

  • Aligns Hexagon NPU Q8_0 quantization with the CPU implementation in llama.cpp
  • Improves quantization accuracy by using smaller group sizes
  • Reduces numerical discrepancies between CPU and NPU backends
  • Preserves the original q8x4 path for performance-oriented use cases
  • Validated on the K projection of layer 0 in the Qwen3-0.6B model, showing an over 35% reduction in L2 error with no observable performance regression.

Summary:

  • quantize_block_fp32_q8x1 → group size 32
  • quantize_block_fp32_q8x2 → group size 64
  • quantize_block_fp32_q8x4 → group size 128

This change aligns the Hexagon NPU backend with the true Q8_0 quantization scheme used on CPU, improving correctness while retaining flexibility for performance tuning and future experimentation.

@ngdxzyngdxzy changed the title ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths)ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operationsDec 12, 2025
@ngdxzyngdxzy changed the title ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU (add q8x1 / q8x2 paths) for more accurate mixed-precision matmul operationsggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operationsDec 12, 2025
@github-actionsgithub-actionsbot added the ggml changes relating to the ggml tensor library for machine learning label Dec 13, 2025
quantize_fp32_q8x4x2(&octx->src1, octx->src1_spad.data, &octx->src0_spad, n, i, octx->src1_nrows_per_thread);
// quantize_block_fp32_q8x4: use group size 128: tested on Qwen3:0.6B k proj layer 0 on 256 tokens, Relative L2 1.7%
// quantize_block_fp32_q8x2: use group size 64 : Relative L2 1.3%
// quantize_block_fp32_q8x1: use group size 32 : Relative L2 1.1%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to pop up with a maybe off-topic question:
Is there a convenient way to test op level L2 divergence in the current codebase? Aware of llama-perplexity, but it seems to be a model-level metric rather than something you can use for a single op.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relative L2 error, rather than absolute value, I guess.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you make a good point. We performed perplexity analysis using llama-perplexity on the Qwen3-0.6B-Q4_0 model with the wikitext2-test-split dataset. The context length was set to 512, which is the default setting in llama-perplexity. The results are summarized below:

  • Current llama.cpp CPU reference
    (Q4_0 × FP32 → Q4_0 × Q8_0):
    23.9779 ± 0.20737

  • Current llama.cpp hexagon implementation
    (Q4_0 × FP32 → Q4_0 × Q8x4):
    24.0817 ± 0.20856

  • This PR on hexagon (Q4x1 / true Q8_0 path)
    (Q4_0 × FP32 → Q4_0 × Q8_0):
    23.9960 ± 0.20763

In addition, we observed an interesting timing result. The current implementation takes 31.97 minutes to complete the perplexity benchmark, while this PR completes it in 31.38 minutes. There may be some measurement noise here, but at the very least, this suggests that the PR does not introduce a performance regression while improving numerical alignment with the CPU reference.

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggmlchanges relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

@ngdxzy@chraac@twflm