- Notifications
You must be signed in to change notification settings - Fork 14.1k
Server: enable lookup decoding#6828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base:master
Are you sure you want to change the base?
Conversation
JohannesGaessler commented Apr 22, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
phymbert commented Apr 22, 2024
Great work. As we discussed previously, servers' test coverage matters, and adding a new scenario in the test framework is mandatory. |
87e5656 to 0ad7512CompareJohannesGaessler commented Apr 22, 2024
Are there already any tests that assert correctness for the server? I didn't see any so as part of this implementation I would try to add some. |
phymbert commented Apr 22, 2024
https://github.com/ggerganov/llama.cpp/tree/master/examples/server/tests |
JohannesGaessler commented Apr 22, 2024
While writing tests I'm noticing that when using > 1 slots the results for a given seed are not consistent on master. @phymbert is this a known problem? |
phymbert commented Apr 22, 2024
I was not aware, but this is not asserted in the parallel test suite AFAIK. Also, I recall that each architecture generates different results. |
arnfaldur commented Apr 25, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
A research paper studying this exact technique was recently published and suggested for integration in an issue #6813
I haven't spent enough time reading |
JohannesGaessler commented Apr 25, 2024
Thanks for the input. I saw the paper but didn't yet get around to reading it.
In practice my implementation also uses N-grams of varying sizes. A |
JohannesGaessler commented Apr 25, 2024
I forgot: if you want to play around with the llama.cpp implementation, take a look at |
JamshedQurbonboev commented Apr 25, 2024
How much does this PR increase token generation? As far I am aware #5479 had rather tiny speedup. And when do you think this PR will be ready to be merged? |
JohannesGaessler commented Apr 25, 2024
Should be something like 1.1-1.4 for natural language with an essentially empty context. For source code or summarization it's going to be a lot more. The numbers in the OP are indicative of the long-term speedup using similar prompts once the dynamic cache fills up.
I'm aiming for the end of the week. |
sorasoras commented Apr 26, 2024
Does this allow us to create a static cache during inference? |
JohannesGaessler commented Apr 26, 2024
No, use |
244508a to 0bf8277CompareJohannesGaessler commented Apr 28, 2024
Functionally I think everything is in order now. Unfortunately I think that it's currently not possible to get bit-for-bit identical results with lookup decoding since the results seem to change slightly when the batch size is varied, see #6950 . For this reason there are no automated tests for lookup decoding that assert that the results do not change (because they do). |
github-actionsbot commented Apr 28, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
ce22137 to 1d516d3CompareJohannesGaessler commented Apr 28, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I'm very sorry but it seems the numbers that I previously reported were incorrect. The speed I reported for "master" was actually the speed for this PR with
For Phi-2 on an RTX 4090 there is a regression relative to master because it is quite fast so the constant overhead per token is too large relative to the speedup. I'll investigate performance for larger models/slower hardware. |
JohannesGaessler commented Apr 28, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Performance for LLaMA 3 70 on 3x RTX 4090 is looking much better:
|
Green-Sky commented Apr 29, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Regarding performance, it seems your hashes for the lookup table are of low quality. edit: this is a wold class article on how to make a fast lookup table, including a pretty neat hash function https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ edit2: the way the hashes are combined in the ngram means that for the 64bit, only 32bit have any entropy at all. A better hash would probably fix this, but hashes are often combined with an extra shift or another multiplication. |
JohannesGaessler commented Apr 29, 2024
Thank you for the high-quality post. I definitely agree that the hashing is suboptimal, my main concern for now is to get something that works at all, and to also implement tests that assert this. |
JohannesGaessler commented Apr 29, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Prior to reading the hashing function blog post I wrote a simple implementation that just uses bit shifts and xors but that already results in much better performance:
|
JamshedQurbonboev commented Apr 29, 2024
Thanks for improving performance of llama.cpp. It seems that you were correct: lookup decoding improves speed, but adds constant overhead. So larger models have greater benefit from it. How does performance looks like for 7-13b models, in slower GPU and CPU-only backends? |
JohannesGaessler commented Apr 29, 2024
I think the model and prompt will be a bigger factor than the hardware as long as the hashing is fast enough. These are some numbers I get on my Epyc 7742 CPU with 8x 3200 MHz Micron DIMMs:
Note that the comparatively large speedups with LLaMA 3 70b are likely a product of heavy repetition since I am using the base model. |
9684f4c to f009e40CompareJohannesGaessler commented May 2, 2024
I've added a test for asserting that lookup decoding produces correct results. The sequences are the same for temperature 0 though the results are not going to be bit-for-bit identical. I've also investigated the performance for LLaMA 3 Instruct in more detail: Results
The speedup between LLaMA 3 instruct 8b and 70b seems to be very similar. The current implementation is only faster for small numbers of slots since there is comparatively less benefit for adding more tokens to the batch if you're already at 8 tokens per batch without any speculative decoding. Successive, similar runs with different seeds but a carried over dynamic cache result in increasing performance over time, for a single slot the 8th run was ~1.2x faster than the first one. From my side I would consider this PR ready to be merged if one last issue is resolved: whether n-gram lookup should be enabled or disabled by default. The default for the number of slots is 1 and for that case it is faster. However, due to the varying batch size it also causes nondeterministic results. I personally would tend more towards having n-gram lookup be disabled by default but do not have a strong opinion on it. |
Green-Sky commented May 2, 2024
@JohannesGaessler can I convince you to quickly add an overload for std::hash<llama_token_t> and do a quick comparison? While the shift in the ngram hash stuffles the hash a bit, it probably is still pretty bad. + this is a very small change. |
JohannesGaessler commented May 2, 2024
I'm not sure what you mean by overload but I'm happy to test suggested alternatives. |
Green-Sky commented May 2, 2024
Try the following: diff --git a/common/ngram-cache.h b/common/ngram-cache.h index 6575ea05..df420e1f 100644 --- a/common/ngram-cache.h+++ b/common/ngram-cache.h@@ -37,13 +37,18 @@ struct llama_ngram{} }}; +struct llama_token_hash_function{+ size_t operator()(const llama_token token) const{+ return token * 11400714819323198485llu;+ }+};+ struct llama_ngram_hash_function{size_t operator()(const llama_ngram & ngram) const{- size_t hash = ngram.tokens[0];+ size_t hash = llama_token_hash_function{}(ngram.tokens[0]); for (int i = 1; i < LLAMA_NGRAM_MAX; ++i){- hash <<= 15;- hash ^= ngram.tokens[i];+ hash ^= llama_token_hash_function{}(ngram.tokens[i]); } return hash; @@ -51,7 +56,7 @@ struct llama_ngram_hash_function{};I went the route you went instead and used another callable type.
Please test :) |
614c74d to d797930CompareJohannesGaessler commented May 3, 2024
I took over the Fibonacci hash implementation. For LLaMA 3 q4_K_M on an RTX 4090 it's maybe a ~1% end-to-end speedup. Results
|
Green-Sky commented May 3, 2024
Different STL implementations will perform differently here. |
393ec77 to 76a97c7Compare| llama_ngram_cache nc_context; | ||
| std::vector<llama_token> draft; | ||
| std::vector<llama_token> context_tokens; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as cache_tokens?
Ideally, the llama_sampling_context should maintain a history of the processed tokens. There are already steps in that direction via the std::vector<llama_token> prev; member and the llama_sampling_accept() API, but some more work would be needed (such as API for removing discarded tokens). Not needed to be done in this PR - just clarifying the long-term goals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it should be possible to refactor the code in such a way that the same vector is used for caching and lookup.
76a97c7 to 48e5699CompareJohannesGaessler commented May 9, 2024
I re-tested the performance on 1x RTX 4090 with CUDA graphs but against my expectations I am seeing virtually no performance difference compared to before:
|
02ef2f9 to 71c98ccComparesorasoras commented May 15, 2024
a quick question |
JohannesGaessler commented May 15, 2024
The numbers for the |
geekboood commented Aug 12, 2024
Hi, I wonder when this PR could be merged. In my scenerio (data labeling), it brings great improvement since my output and input has great overlap. Below is my test result (by using llama-lookup). |
JohannesGaessler commented Aug 12, 2024
Do keep in mind that the speedup using lookup decoding will be smaller than the speedup you get from using multiple slots and that these two features are essentially mutually exclusive. |
etafund commented Sep 17, 2024
Also interested in when this PR could be merged. Thanks everyone. |
JohannesGaessler commented Sep 18, 2024
As of right now I don't think it's worthwhile to add lookup decoding at all. The speedup (or lack thereof) is too inconsistent and situational. |
etafund commented Sep 18, 2024
For my situation, which is using CPU inference on AMD Epyc Genoa and using llama 3.1 8B as the draft model for llama 3.1 405B, it makes a significant difference consistently. 🤷♂️ |
JohannesGaessler commented Sep 18, 2024
That is not what this PR is about. This PR is about trying to infer the next token without any draft model at all. |
etafund commented Sep 18, 2024
I see. I was confused by ggerganov's comment here then. Thank you Johannes. |
eeroel commented Mar 1, 2025
@JohannesGaessler I implemented something related here: #12127 Do you want to check if that makes sense to you, and if some ideas from this PR could be reused there for better performance for example? |




This PR aims to enable lookup decoding for the llama.cpp server in the same way as it is used in
examples/lookup, see #5479 . To recapitulate, the implementation tries to guess the next few tokens that will be generated using simple text statistics. I think the current implementation works but it is difficult to properly benchmark. The intended way for it to work is:These are the results I get from
examples/server/bench.pyusing an RTX 4090 and various static lookup caches and an initially empty dynamic cache:masterPR --draft 0Edit: the table was labeled incorrectly. The speedup was not relative to master but relative to
--draft 0which included the overhead for no benefit.It does seem to provide a speedup but adding a static lookup cache does not seem to help (the caches are created either from Wikitext 103 or from 164 million tokens generated with Mistral q8_0). Assuming there are no bugs, what I think is happening is that the dataset for the benchmark (see server bench README) is very repetitive so using a static cache pulls the drafts away from these very repetitive patterns and reduces the speed. Also for Phi-2 in particular I think that I simply don't have enough input data for the static cache to get sufficiently precise text statistics (since it has a larger vocabulary size). Regarding the latter, I recently built a machine with 6x RTX 4090 so I think I will be able to significantly scale up the rate at which I can produce synthetic text (I was previously using 3x P40 and 1x RX 6800).
In this PR I also changed the interface of
llama_ngram_cache_loadto be more in line with the rest of llama.cpp; I'll maybe change the interface of some of the other functions as well.Also: is it somehow possible to retrieve the tokens that were previously fed to the model? I'm currently manually tracking this in
server.cppbut this adds the potential for error.