Add support for BERT embedding models#5423

iamlemec · 2024-02-08T20:50:48Z

Following discussion in #2872, adds support for BERT model architecture. Built on top of various contributions from @skeskinen, @xyzhang626, and @cebtenzzre. Includes:

New WordPiece tokenizer llm_tokenize_wpm. Needed for slightly different behavior from SentencePiece. On conversion, vocab is mapped from ## subword scheme to ▁ prefix scheme to allow for unified vocab mappings.
New model fields bert.attention.causal that controls whether attention mask is causal or not (default is true). Also tokenizer.ggml.token_type_count which accounts for token type info, though these are tpyically ignored in actual computations.
Addition of build_bert for graph construction. This is fairly standard. The only difference is the pooling layer at the end. Currently it will pool the entire batch. Ideally, it could be made to pool only within sequence.

In terms of which models actually work, the main limitation is tokenization. I have tested with all-MiniLM-L6-v2 and BAAI/bge-*-*-v1.5 (small, base, and large plus en and zh) and they seem to work and the embedding numbers look similar to Huggingface implementations. The newer BAAI/bge-m3 uses a SentencePiece tokenizer, so it should be doable but I haven't tested it.

convert-hf-to-gguf.py

cebtenzzre · 2024-02-08T21:16:54Z

convert-hf-to-gguf.py

+self.block_count=self.hparams["num_hidden_layers"]
+
+defset_gguf_parameters(self):
+# TODO(cebtenzzre): merge with parent class


Note to self: resolve this before merge

have you... have you forgotten about this...

convert-hf-to-gguf.py

Co-authored-by: Jared Van Bortel <[email protected]>

llama.cpp

ggerganov · 2024-02-09T13:12:00Z

In terms of which models actually work, the main limitation is tokenization.

When I was playing with bert.cpp the other day, I noticed some potential problems with the tokenization when using a bge-base model. For example:

./build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "This is a ggml"

Tokenizes to:

101 -> [CLS] 2023 -> this 2003 -> is 1037 -> a 2290 -> ##g 19968 -> ##ml 102 -> [SEP]

Seems like a g is missing and also there is an extra concat ## in the 2290 token. So the tokenization might need some more work, but this can be improved later

llama.cpp

iamlemec · 2024-02-09T17:00:13Z

When I was playing with bert.cpp the other day, I noticed some potential problems with the tokenization when using a bge-base model.

Ah yeah, that was a bug in bert.cpp that was fixed a few days ago. It's correct in this PR.

iamlemec · 2024-02-10T22:45:06Z

I have batched embedding working now (bert-batched). Basically just matmul an [n_tokens, n_tokens] pooling matrix at the end. It would make more sense for it to be [n_tokens, n_seq_max], but we don't actually know n_seq_max, so this is a worst case scenario. Embeddings can be fetched by seq_id just like with logits using get_embeddings_ith. Updated the embeddings example to split by lines and embed as separate sequences in one batch.

Should I push this to this PR or wait until this goes through and start a new one?

ggerganov · 2024-02-11T10:54:43Z

llama.cpp

-//the output is always the last tensor in the graph
-structggml_tensor * res = gf->nodes[gf->n_nodes - 1];
-GGML_ASSERT(strcmp(res->name, "result_output") == 0);
+//get logits and embeddings
+structggml_tensor * res = ggml_graph_get_tensor(gf, "result_output");
+structggml_tensor * embeddings = ggml_graph_get_tensor(gf, "result_norm");


Using ggml_graph_get_tensor is not recommended here because it will do a strcmp with the entire graph which can become noticeable in terms of speed. For now, we should be "poking" at the last few tensors to find what we need - not great, but will improve in the future

ggerganov

Let's fix the ggml_graph_get_tensor comment and merge. After that, we can look into batching support in separate PR

iamlemec · 2024-02-27T18:23:59Z

It looks like all-mpnet has T5-style relative position embeddings. I don't think those are supported here yet.

mofanke · 2024-03-10T14:52:41Z

i tried BAAI/bge-m3 , but i does not work by now. because the model architectures is XLMRobertaModel not Bert , and "tokenizer_class": "XLMRobertaTokenizer"

cebtenzzre · 2024-03-12T17:33:52Z

i tried BAAI/bge-m3 , but i does not work by now. because the model architectures is XLMRobertaModel not Bert , and "tokenizer_class": "XLMRobertaTokenizer"

You could open a feature request if you haven't already.

@xyzhang626

* BERT model graph construction (build_bert) * WordPiece tokenizer (llm_tokenize_wpm) * Add flag for non-causal attention models * Allow for models that only output embeddings * Support conversion of BERT models to GGUF * Based on prior work by @xyzhang626 and @skeskinen --------- Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Jared Van Bortel <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

mofanke · 2024-03-14T03:11:39Z

#6007 already done

hiepxanh · 2024-03-17T09:18:50Z

#6007 already done

What do you mean, I think the PR is not support yet?

I try convert to day an see this one?

(llama.cpp) E:\pre-built\llama.cpp>python convert-hf-to-gguf.py models/multilingual-e5-large/ Loading model: multilingual-e5-large Traceback (most recent call last): File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module> main() File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2053, in main model_class = Model.from_model_architecture(hparams["architectures"][0]) File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 204, in from_model_architecture raise NotImplementedError(f'Architecture{arch!r} not supported!') from None NotImplementedError: Architecture 'XLMRobertaModel' not supported!

mofanke · 2024-03-18T06:26:37Z

#6007 already done

What do you mean, I think the PR is not support yet?

I try convert to day an see this one?

(llama.cpp) E:\pre-built\llama.cpp>python convert-hf-to-gguf.py models/multilingual-e5-large/ Loading model: multilingual-e5-large Traceback (most recent call last): File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module> main() File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2053, in main model_class = Model.from_model_architecture(hparams["architectures"][0]) File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 204, in from_model_architecture raise NotImplementedError(f'Architecture{arch!r} not supported!') from None NotImplementedError: Architecture 'XLMRobertaModel' not supported!

aha， I'm sorry for causing you confusion， i just mean i opened a feature request

In order to get support for BERT based sentence embedding models like BAAI/bge-base-en-v1.5, mixedbread-ai/mxbai-embed-large-v1, or others, update llama.cpp from b1696 (2023-12-12): https://github.com/ggerganov/llama.cpp/releases/tag/b1696 to the current latest release b2581 (2024-03-30): https://github.com/ggerganov/llama.cpp/releases/tag/b2581 BERT support was added to llama.cpp in February 2024: ggml-org/llama.cpp#5423

jkgenser · 2024-04-02T16:47:48Z

Just tried google bert uncased and raised NotImplementedError: Architecture "BertForMaskedLM" not supported! I probably miss something here.
another model NotImplementedError: Architecture "BertForSequenceClassification" not supported!
These are BERT models that have been pretrained to create embeddings for individual words, but this PR is for BERT models that have been trained to generate embeddings for entire sentences and paragraphs, and those will not produce good results here.
The keyword you are looking for is "SBert", or Sentence Transformers in general (paper, website, HF).
nomic-embed-text-v1 is a good model to start with. Disclosure: I work for Nomic.
bge-base-en-v1.5 is another BERT of similar size.

So if I finetune as bert model for classification task, it would not work to convert it to GGML? I've been watching this work and really excited to be able to deploy my fine-tuned BERT models on llama.cpp

beyondskyway · 2024-04-23T12:25:19Z

#6007 already done

What do you mean, I think the PR is not support yet?
I try convert to day an see this one?

(llama.cpp) E:\pre-built\llama.cpp>python convert-hf-to-gguf.py models/multilingual-e5-large/ Loading model: multilingual-e5-large Traceback (most recent call last): File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2073, in <module> main() File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 2053, in main model_class = Model.from_model_architecture(hparams["architectures"][0]) File "E:\pre-built\llama.cpp\convert-hf-to-gguf.py", line 204, in from_model_architecture raise NotImplementedError(f'Architecture{arch!r} not supported!') from None NotImplementedError: Architecture 'XLMRobertaModel' not supported!

aha， I'm sorry for causing you confusion， i just mean i opened a feature request

same with convert https://huggingface.co/maidalun1020/bce-embedding-base_v1/tree/main

ggerganov · 2024-05-13T11:24:55Z

Where is the reference implementation of XLMRobertaModel for models such as https://huggingface.co/intfloat/multilingual-e5-base/tree/main? Would like to add support for these

iamlemec · 2024-05-13T13:31:14Z

I think the original is here at fairseq: https://github.com/facebookresearch/fairseq/blob/main/fairseq/models/roberta. There's also an implementation in transformers: https://github.com/huggingface/transformers/tree/main/src/transformers/models/xlm_roberta. I've actually been looking into XLMRobertaModel to run BAAI/bge-m3, and comparing the transformers implementations, I think the model side is actually identical to BERT.

But there are differences in the tokenization that have driven me slightly mad trying to understand. The model file is called sentencepiece.bpe.model, but it appears to be an actual SentencePiece (unigram) style model, not BPE. Even then the way it handles spaces with metaspaces looks to be a little different from the way SPM works in the current llama.cpp implementation.

ggerganov · 2024-05-13T13:46:23Z

Thanks!

Even then the way it handles spaces with metaspaces looks to be a little different from the way SPM works in the current llama.cpp implementation.

Maybe the "clean_up_tokenization_spaces" parameter is controlling this behaviour?

https://huggingface.co/BAAI/bge-m3/blob/main/tokenizer_config.json#L3

sragrawal · 2024-06-24T19:45:35Z

Hi All, is there any plan to support XLMRobertaModel? https://huggingface.co/intfloat/multilingual-e5-small works very well for multilingual embeddings for its size (https://huggingface.co/spaces/mteb/leaderboard). Please let me know if there if I should open a new issue for this.

iamlemec · 2024-06-25T00:10:22Z

@sragrawal I believe that Unigram support from #8089 will get us most of the way there on the XLMRoberta tokenizer (which is featured in this and others such as BAAI/bge-m3). The main thing is loading and using the trie structure stored in precompiled_charsmap. There may be some additional pretokenization stuff, but that should be easier to handle.

grigohas · 2024-10-09T10:05:50Z

Hello, is there a work flow on how to build and run bert through llama.cpp ?

iacore · 2024-10-09T19:53:04Z

Hello, is there a work flow on how to build and run bert through llama.cpp ?

I wrote about it here. Not sure what "workflow" you are referring to.

grigohas · 2024-10-15T11:54:23Z

Is there a way to use llama.cpp to generate text with bert ?

iacore · 2024-11-03T20:40:30Z

BERT is not an LLM, afaik.

cebtenzzreand others added 7 commits February 6, 2024 17:10

BERT WIP
7286b83

merge from master
ef10d78

it runs; tokenization is messed up; pooling is wrong for multi batches
0051c82

add in wordpiece tokenizer
59c1829

put causal_attn flag in gguf
5f1c21d

Merge remote-tracking branch 'origin/master' into bert
e0e14e3

Merge remote-tracking branch 'upstream/master' into bert
7218c7b

cebtenzzre reviewed Feb 8, 2024
View reviewed changes

iamlemecand others added 2 commits February 8, 2024 17:33

Update convert-hf-to-gguf.py
e3efcf1
Co-authored-by: Jared Van Bortel <[email protected]>

add causal attention gguf key
96d37f8

slaren reviewed Feb 9, 2024
View reviewed changes

llama.cpp Outdated Show resolvedHide resolved
llama.cpp Outdated Show resolvedHide resolved
llama.cpp Outdated Show resolvedHide resolved
llama.cppShow resolvedHide resolved

cebtenzzreand others added 5 commits February 8, 2024 21:17

use ctx_output for tok_norm of BERT and BLOOM
e78388d

bert : add some missing graph callbacks
b14c457

fix up model sizing and result acquisition
6875808

hard-code token_type = 0
d080beb

Merge branch 'bert' of github.com:iamlemec/llama.cpp into bert
3a1895d

ggerganov reviewed Feb 9, 2024
View reviewed changes

llama.cppShow resolvedHide resolved

iamlemecand others added 4 commits February 9, 2024 11:53

style fixes
961e98f

undo attempted type_embd simplify
56afb2f

bert : simplify token type embedding access
ab49e9e

flake8 : add W503 to ignore list
6972e7e

ggerganov reviewed Feb 11, 2024
View reviewed changes

minor : code style normalization
8fbefed

ggerganov approved these changes Feb 11, 2024
View reviewed changes

iamlemec added 2 commits February 11, 2024 09:50

avoid use of ggml_graph_get_tensor
e379e8c

Merge branch 'bert' of github.com:iamlemec/llama.cpp into bert
61bab47

iamlemec merged commit 2891c8a into ggml-org:masterFeb 11, 2024

fakerybakery mentioned this pull request Mar 1, 2024
BERT Support abetlen/llama-cpp-python#1240
Closed

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
llama : fix quantization when tensors are missing (ggml-org#5423)
c587c42

howlger mentioned this pull request Apr 1, 2024
[llama.cpp] Update llama.cpp to latest release b2581 (2024-03-30) deepjavalibrary/djl#3055
Closed

jeryaiwei mentioned this pull request Apr 4, 2024
rpc error: code = Unknown desc = unimplemented mudler/LocalAI#800
Open

mofosyne added enhancement New feature or request model Model specific Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 13, 2024

daboe01 mentioned this pull request Jun 25, 2024
idea: embeddings should be generated using llama.cpp abgulati/LARS#9
Closed

aindlq mentioned this pull request Mar 14, 2025
Qlever embeddings based similarity search index. ad-freiburg/qlever#1877
Open

Add support for BERT embedding models#5423

Add support for BERT embedding models #5423

Conversation

iamlemec commented Feb 8, 2024

Uh oh!

Uh oh!

cebtenzzreFeb 8, 2024

Choose a reason for hiding this comment

Uh oh!

iacoreJun 29, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Feb 9, 2024

Uh oh!

Uh oh!

iamlemec commented Feb 9, 2024

Uh oh!

iamlemec commented Feb 10, 2024

Uh oh!

ggerganovFeb 11, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

iamlemec commented Feb 27, 2024

Uh oh!

mofanke commented Mar 10, 2024

Uh oh!

cebtenzzre commented Mar 12, 2024

Uh oh!

mofanke commented Mar 14, 2024• edited by cebtenzzre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hiepxanh commented Mar 17, 2024

Uh oh!

mofanke commented Mar 18, 2024

Uh oh!

jkgenser commented Apr 2, 2024

Uh oh!

beyondskyway commented Apr 23, 2024

Uh oh!

ggerganov commented May 13, 2024

Uh oh!

iamlemec commented May 13, 2024

Uh oh!

ggerganov commented May 13, 2024

Uh oh!

sragrawal commented Jun 24, 2024

Uh oh!

iamlemec commented Jun 25, 2024

Uh oh!

grigohas commented Oct 9, 2024

Uh oh!

iacore commented Oct 9, 2024• edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grigohas commented Oct 15, 2024

Uh oh!

iacore commented Nov 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

iacoreJun 29, 2024•
edited
Loading

mofanke commented Mar 14, 2024•
edited by cebtenzzre
Loading

iacore commented Oct 9, 2024•
edited
Loading