- Notifications
You must be signed in to change notification settings - Fork 14.1k
Add support for BERT embedding models#5423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Uh oh!
There was an error while loading. Please reload this page.
| self.block_count=self.hparams["num_hidden_layers"] | ||
| defset_gguf_parameters(self): | ||
| # TODO(cebtenzzre): merge with parent class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: resolve this before merge
iacoreJun 29, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you... have you forgotten about this...
Uh oh!
There was an error while loading. Please reload this page.
Co-authored-by: Jared Van Bortel <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
ggerganov commented Feb 9, 2024
When I was playing with ./build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "This is a ggml"Tokenizes to: Seems like a |
Uh oh!
There was an error while loading. Please reload this page.
iamlemec commented Feb 9, 2024
Ah yeah, that was a bug in |
iamlemec commented Feb 10, 2024
I have batched embedding working now (bert-batched). Basically just matmul an Should I push this to this PR or wait until this goes through and start a new one? |
llama.cpp Outdated
| //the output is always the last tensor in the graph | ||
| structggml_tensor * res = gf->nodes[gf->n_nodes - 1]; | ||
| GGML_ASSERT(strcmp(res->name, "result_output") == 0); | ||
| //get logits and embeddings | ||
| structggml_tensor * res = ggml_graph_get_tensor(gf, "result_output"); | ||
| structggml_tensor * embeddings = ggml_graph_get_tensor(gf, "result_norm"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ggml_graph_get_tensor is not recommended here because it will do a strcmp with the entire graph which can become noticeable in terms of speed. For now, we should be "poking" at the last few tensors to find what we need - not great, but will improve in the future
ggerganov left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix the ggml_graph_get_tensor comment and merge. After that, we can look into batching support in separate PR
iamlemec commented Feb 27, 2024
It looks like |
mofanke commented Mar 10, 2024
i tried BAAI/bge-m3 , but i does not work by now. because the model architectures is XLMRobertaModel not Bert , and "tokenizer_class": "XLMRobertaTokenizer" |
cebtenzzre commented Mar 12, 2024
You could open a feature request if you haven't already. |
* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
mofanke commented Mar 14, 2024 • edited by cebtenzzre
Loading Uh oh!
There was an error while loading. Please reload this page.
edited by cebtenzzre
Uh oh!
There was an error while loading. Please reload this page.
#6007 already done |
hiepxanh commented Mar 17, 2024
What do you mean, I think the PR is not support yet? I try convert to day an see this one? |
mofanke commented Mar 18, 2024
aha, I'm sorry for causing you confusion, i just mean i opened a feature request |
In order to get support for BERT based sentence embedding models like BAAI/bge-base-en-v1.5, mixedbread-ai/mxbai-embed-large-v1, or others, update llama.cpp from b1696 (2023-12-12): https://github.com/ggerganov/llama.cpp/releases/tag/b1696 to the current latest release b2581 (2024-03-30): https://github.com/ggerganov/llama.cpp/releases/tag/b2581 BERT support was added to llama.cpp in February 2024: ggml-org/llama.cpp#5423
jkgenser commented Apr 2, 2024
So if I finetune as bert model for classification task, it would not work to convert it to GGML? I've been watching this work and really excited to be able to deploy my fine-tuned BERT models on llama.cpp |
beyondskyway commented Apr 23, 2024
same with convert https://huggingface.co/maidalun1020/bce-embedding-base_v1/tree/main |
ggerganov commented May 13, 2024
Where is the reference implementation of |
iamlemec commented May 13, 2024
I think the original is here at fairseq: https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/roberta. There's also an implementation in But there are differences in the tokenization that have driven me slightly mad trying to understand. The model file is called |
ggerganov commented May 13, 2024
Thanks!
Maybe the https://huggingface.co/BAAI/bge-m3/blob/main/tokenizer_config.json#L3 |
sragrawal commented Jun 24, 2024
Hi All, is there any plan to support XLMRobertaModel? https://huggingface.co/intfloat/multilingual-e5-small works very well for multilingual embeddings for its size (https://huggingface.co/spaces/mteb/leaderboard). Please let me know if there if I should open a new issue for this. |
iamlemec commented Jun 25, 2024
@sragrawal I believe that Unigram support from #8089 will get us most of the way there on the |
grigohas commented Oct 9, 2024
Hello, is there a work flow on how to build and run bert through llama.cpp ? |
iacore commented Oct 9, 2024 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
I wrote about it here. Not sure what "workflow" you are referring to. |
grigohas commented Oct 15, 2024
Is there a way to use llama.cpp to generate text with bert ? |
iacore commented Nov 3, 2024
BERT is not an LLM, afaik. |
Following discussion in #2872, adds support for BERT model architecture. Built on top of various contributions from @skeskinen, @xyzhang626, and @cebtenzzre. Includes:
llm_tokenize_wpm. Needed for slightly different behavior from SentencePiece. On conversion, vocab is mapped from##subword scheme to▁prefix scheme to allow for unified vocab mappings.bert.attention.causalthat controls whether attention mask is causal or not (default istrue). Alsotokenizer.ggml.token_type_countwhich accounts for token type info, though these are tpyically ignored in actual computations.build_bertfor graph construction. This is fairly standard. The only difference is the pooling layer at the end. Currently it will pool the entire batch. Ideally, it could be made to pool only within sequence.In terms of which models actually work, the main limitation is tokenization. I have tested with
all-MiniLM-L6-v2andBAAI/bge-*-*-v1.5(small,base, andlargeplusenandzh) and they seem to work and the embedding numbers look similar to Huggingface implementations. The newerBAAI/bge-m3uses a SentencePiece tokenizer, so it should be doable but I haven't tested it.