Skip to content

Conversation

@navmarri14
Copy link

@navmarri14navmarri14 commented Dec 14, 2025

Purpose

This PR adds a tuned fused MoE kernel configuration for the GLM-4.6 MoE architecture on NVIDIA B300 GPUs using FP8 quantization.

Specifically, it targets the configuration:

Experts (E): 160
Sharded size N=192 for TP=8, N=384 for TP=4, N=768 for TP=2
Device: NVIDIA B300
Dtype: fp8_w8a8
Previously, vLLM lacked a static configuration for these shapes on B300, causing it to fallback to heuristics or require JIT tuning during startup. This config improves startup time and ensures optimal kernel parameters are used for GLM-4 variants when running with tensor_parallel_size=2,4,8.

Test Plan
Generation:
The configuration was generated using the official benchmark script on an 8x B300 node:

TP=2

python benchmarks/kernels/benchmark_moe.py \ --model /path/to/ZhipuAI/GLM-4.6-FP8 \ --dtype fp8_w8a8 \ --tp-size 2 \ --tune \ --trust-remote-code \ --save-dir ./configs 

TP=4

python benchmarks/kernels/benchmark_moe.py \ --model /path/to/ZhipuAI/GLM-4.6-FP8 \ --dtype fp8_w8a8 \ --tp-size 4 \ --tune \ --trust-remote-code \ --save-dir ./configs 

TP=8

python benchmarks/kernels/benchmark_moe.py \ --model /path/to/ZhipuAI/GLM-4.6-FP8 \ --dtype fp8_w8a8 \ --tp-size 8 \ --tune \ --trust-remote-code \ --save-dir ./configs 

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assistgemini-code-assistbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces tuned fused MoE kernel configurations for the GLM-4.6 MoE architecture on NVIDIA B300 GPUs with FP8 quantization. These configurations for different tensor parallelism sizes (2, 4, 8) will help improve startup times by avoiding the need for JIT tuning. The changes are well-described and the method for generating the configurations is clearly documented.

My main feedback is a minor issue across all three new configuration files: the specified Triton version 3.5.1 seems to be incorrect, as it does not correspond to a public Triton release. I've left comments with suggestions to correct or remove this for clarity and maintainability.

@@ -0,0 +1,147 @@
{
"triton_version": "3.5.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The specified Triton version 3.5.1 appears to be incorrect, as there is no public release with this version number. The latest public release of Triton is 3.0.0. This could be misleading for future developers. Please either correct it to the version you used for tuning or remove this line. Since this field is popped from the config before use, removing it would have no functional impact.

@@ -0,0 +1,147 @@
{
"triton_version": "3.5.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The specified Triton version 3.5.1 appears to be incorrect, as there is no public release with this version number. The latest public release of Triton is 3.0.0. This could be misleading for future developers. Please either correct it to the version you used for tuning or remove this line. Since this field is popped from the config before use, removing it would have no functional impact.

@@ -0,0 +1,147 @@
{
"triton_version": "3.5.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The specified Triton version 3.5.1 appears to be incorrect, as there is no public release with this version number. The latest public release of Triton is 3.0.0. This could be misleading for future developers. Please either correct it to the version you used for tuning or remove this line. Since this field is popped from the config before use, removing it would have no functional impact.

@jeejeelee
Copy link
Collaborator

cc @mgoin

1 similar comment
@ApostaC
Copy link
Collaborator

cc @mgoin

Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

@navmarri14@jeejeelee@ApostaC