- Notifications
You must be signed in to change notification settings - Fork 14.1k
scripts: add script to compare logprobs of llama.cpp against other frameworks#17947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base:master
Are you sure you want to change the base?
Conversation
| "top_k": 1, | ||
| "max_tokens": 1, | ||
| "logprobs": 1, | ||
| "stream": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just note that I'm no longer use the echo option because the support is hit-or-missed across frameworks
pwilkin commented Dec 12, 2025
If this is a going to be a general tool I'd drop the prompt fragment about the tool call syntax, especially since we're not providing any tools. |
ngxson commented Dec 12, 2025
Tools and chat template are not handled here, we are using the raw completions endpoints. The goal is a bit like perplexity test, where we don't actually care about the input text, we only care about the logits of the next predicted token |
ggerganov left a comment • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tangentially, thinking if we can prepare a script that automates the comparison for a set of most prominent models:
- Download latest HF repo of original model
- Convert to GGUF BF16
- Compare logprobs, generate report
- Delete models
- Goto 1. for next model
The main issue is to get hardware to run this periodically.
One more thing - this roughly tests the logprobs at mostly empty context. It would be useful to be able to run the comparison at larger depth - f.ex. logprobs for 100 tokens at 32k tokens depth. The reason is because some models change their parameters at larger depths and we should be able to test that too.
pwilkin commented Dec 12, 2025
@ggerganov The scripts by @danbev in As for the second task, I've been thinking of the same thing recently, in fact, I've already started working on a branch for that as well. Since the The biggest issue though is after processing 20k+ tokens the logits will almost never match exactly - the slight numerical differences will accumulate too much to make this test reasonable (I mean, even the fact that we're doing norm calcs in F16 in CUDA and Transformers does it in F32 will probably be enough). Unless we somehow make an exact copy of the KV cache and only process the tokens, I don't think we can reliably get a reasonable result from that. |
ngxson commented Dec 12, 2025
I think we could already somewhat automate this workflow by deploying everything to HF inference endpoint, which supports both vLLM and llama.cpp. The main downside is that we can only do this with publicly available weights and it can be quite tricky to test a specific PR (since HFE only accepts publicly available docker image) But still, I think we can already test some of the existing models (lmk which one you think we should test). Will modify the script to pick N tokens at a deeper context length (unfortunately we cannot count exactly by tokens, because the runtime may not public the API for counting it) |
ggerganov commented Dec 13, 2025
@ngxson I'll see if I can make a setup on my DGX Spark - need to learn how to run vllm though. @pwilkin There are various advantages of this comparison:
The logprobs actually align quite well at long contexts since the result for token N is no longer a function of the result for token N-1 and hence there is no accumulation of numerical differences. |
Uh oh!
There was an error while loading. Please reload this page.