- Notifications
You must be signed in to change notification settings - Fork 14.2k
Closed
Labels
Description
So, I found out that \n\n if appended by a character tokenizes as ['\n',\n'] ([198, 198]) instead of ['\n\n'] ([271]).
(I'm using Llama3 for this example, but this extends to other models as well)
Here's an example prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You're Psy, user's assistant, and a master of concise replies.<|eot_id|><|start_header_id|>user<|end_header_id|> Write a short poem<|eot_id|><|start_header_id|>assistant<|end_header_id|> If I switch the template to use \n\n\n\n (1038) it tokenizes as ['\n\n\n', '\n'] ([1432, 198]):
(Note: I know there've been efforts in making special tokens render, but rn I understand they don't have a textual representation, so you can ignore tokens like 128000, 128006 and 128007 in the sequences above)
In C# I patch the issue like so:
vartokensCount=NativeApi.llama_tokenize(model,bytesPtr,bytes.Length,tokensPtr,tokenBuffer.Length,add_bos,special);varlist=newList<LLamaToken>();for(inti=0;i<tokensCount;i++){// Hack: ['\n','\n'] --> ['\n\n']if(tokenBuffer[i]==198&&tokenBuffer[i+1]==198){list.Add(271);i++;}else{list.Add(tokenBuffer[i]);}}returnlist.ToArray();(ignoring all \n merges except the \n\n which is common for the template)
HarperGrieve, LostRuins, lin72h and luoshmgLostRuinsteleprint-me
