Uh oh!
There was an error while loading. Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork 12k
tuned fused configs for B300#30629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base:main
Are you sure you want to change the base?
tuned fused configs for B300 #30629
Conversation
navmarri14 commented Dec 14, 2025 • edited by github-actions bot
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by github-actions bot
Uh oh!
There was an error while loading. Please reload this page.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces tuned fused MoE kernel configurations for the GLM-4.6 MoE architecture on NVIDIA B300 GPUs with FP8 quantization. These configurations for different tensor parallelism sizes (2, 4, 8) will help improve startup times by avoiding the need for JIT tuning. The changes are well-described and the method for generating the configurations is clearly documented.
My main feedback is a minor issue across all three new configuration files: the specified Triton version 3.5.1 seems to be incorrect, as it does not correspond to a public Triton release. I've left comments with suggestions to correct or remove this for clarity and maintainability.
| @@ -0,0 +1,147 @@ | |||
| { | |||
| "triton_version": "3.5.1", | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specified Triton version 3.5.1 appears to be incorrect, as there is no public release with this version number. The latest public release of Triton is 3.0.0. This could be misleading for future developers. Please either correct it to the version you used for tuning or remove this line. Since this field is popped from the config before use, removing it would have no functional impact.
| @@ -0,0 +1,147 @@ | |||
| { | |||
| "triton_version": "3.5.1", | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specified Triton version 3.5.1 appears to be incorrect, as there is no public release with this version number. The latest public release of Triton is 3.0.0. This could be misleading for future developers. Please either correct it to the version you used for tuning or remove this line. Since this field is popped from the config before use, removing it would have no functional impact.
| @@ -0,0 +1,147 @@ | |||
| { | |||
| "triton_version": "3.5.1", | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specified Triton version 3.5.1 appears to be incorrect, as there is no public release with this version number. The latest public release of Triton is 3.0.0. This could be misleading for future developers. Please either correct it to the version you used for tuning or remove this line. Since this field is popped from the config before use, removing it would have no functional impact.
jeejeelee commented Dec 14, 2025
cc @mgoin |
1 similar comment
ApostaC commented Dec 16, 2025
cc @mgoin |
Purpose
This PR adds a tuned fused MoE kernel configuration for the GLM-4.6 MoE architecture on NVIDIA B300 GPUs using FP8 quantization.
Specifically, it targets the configuration:
Experts (E): 160
Sharded size N=192 for TP=8, N=384 for TP=4, N=768 for TP=2
Device: NVIDIA B300
Dtype: fp8_w8a8
Previously, vLLM lacked a static configuration for these shapes on B300, causing it to fallback to heuristics or require JIT tuning during startup. This config improves startup time and ensures optimal kernel parameters are used for GLM-4 variants when running with tensor_parallel_size=2,4,8.
Test Plan
Generation:
The configuration was generated using the official benchmark script on an 8x B300 node:
TP=2
TP=4
TP=8