EN | 简体中文
Sponsored by Mixedbread
For more detailed usage, please read the 📘 document:https://angle.readthedocs.io/en/latest/index.html
📢 Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.
Loss:
- 📐 AnglE loss (ACL24)
- ⚖ Contrastive loss
- 📏 CoSENT loss
- ☕️ Espresso loss (ICLR 2025, a.k.a 2DMSE, detail: README_ESE)
Backbones:
- BERT-based models (BERT, RoBERTa, ModernBERT, etc.)
- LLM-based models (LLaMA, Mistral, Qwen, etc.)
- Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)
Training:
- Single-GPU training
- Multi-GPU training
📅 May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.
📅 Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.
📅 Mar 8, 2024 | 🍞 mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!
📅 Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.
📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!
BERT-based models:
| 🤗 HF | Max Tokens | Pooling Strategy | Scenario |
|---|---|---|---|
| WhereIsAI/UAE-Large-V1 | 512 | cls | English, General-purpose |
| WhereIsAI/UAE-Code-Large-V1 | 512 | cls | Code Similarity |
| WhereIsAI/pubmed-angle-base-en | 512 | cls | Medical Similarity |
| WhereIsAI/pubmed-angle-large-en | 512 | cls | Medical Similarity |
LLM-based models:
| 🤗 HF (lora weight) | Backbone | Max Tokens | Prompts | Pooling Strategy | Scenario |
|---|---|---|---|---|---|
| SeanLee97/angle-llama-13b-nli | NousResearch/Llama-2-13b-hf | 4096 | Prompts.A | last token | English, Similarity Measurement |
| SeanLee97/angle-llama-7b-nli-v2 | NousResearch/Llama-2-7b-hf | 4096 | Prompts.A | last token | English, Similarity Measurement |
💡 You can find more third-party embeddings trained with AnglE in HuggingFace Collection
use uv
uv pip install -U angle-embor pip
pip install -U angle-embOption A: With Prompts (for Retrieval Tasks)
Use prompts with {text} as placeholder. Check available prompts via Prompts.list_prompts().
fromangle_embimportAnglE, Promptsfromangle_emb.utilsimportcosine_similarity# Load modelangle=AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda() # Encode query with prompt, documents without promptqv=angle.encode(['what is the weather?'], to_numpy=True, prompt=Prompts.C) doc_vecs=angle.encode([ 'The weather is great!', 'it is rainy today.', 'i am going to bed' ], to_numpy=True) # Calculate similarityfordvindoc_vecs: print(cosine_similarity(qv[0], dv))Option B: Without Prompts (for Similarity Tasks)
fromangle_embimportAnglEfromangle_emb.utilsimportcosine_similarity# Load modelangle=AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda() # Encode documentsdoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ]) # Calculate pairwise similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))For LoRA-based models, specify both the backbone model and LoRA weights. Always set is_llm=True for LLM models.
importtorchfromangle_embimportAnglE, Promptsfromangle_emb.utilsimportcosine_similarity# Load LLM with LoRA weightsangle=AnglE.from_pretrained( 'NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2', pooling_strategy='last', is_llm=True, torch_dtype=torch.float16 ).cuda() # Encode with promptdoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ], prompt=Prompts.A) # Calculate similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))Enable bidirectional LLMs with apply_billm=True and specify the model class.
importosimporttorchfromangle_embimportAnglEfromangle_emb.utilsimportcosine_similarity# Set BiLLM environment variableos.environ['BiLLM_START_INDEX'] ='31'# Load BiLLM modelangle=AnglE.from_pretrained( 'NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/bellm-llama-7b-nli', pooling_strategy='last', is_llm=True, apply_billm=True, billm_model_class='LlamaForCausalLM', torch_dtype=torch.float16 ).cuda() # Encode with custom promptdoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ], prompt='The representative word for sentence{text} is:"') # Calculate similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))Truncate layers and embedding dimensions for flexible model compression.
fromangle_embimportAnglEfromangle_emb.utilsimportcosine_similarity# Load modelangle=AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda() # Truncate to specific layerangle=angle.truncate_layer(layer_index=22) # Encode with truncated embedding sizedoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ], embedding_size=768) # Calculate similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))Load any transformer-based models (e.g., sentence-transformers, BAAI/bge, etc.) using AnglE.
fromangle_embimportAnglE# Load third-party modelmodel=AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda() # Encode textvec=model.encode('hello world', to_numpy=True) print(vec)Speed up inference with the batched library (recommended for large-scale processing).
uv pip install batchedimportbatchedfromangle_embimportAnglE# Load modelmodel=AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda() # Enable dynamic batchingmodel.encode=batched.dynamically(model.encode, batch_size=64) # Encode large batchvecs=model.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ] *50)💡 For complete details, see the official training documentation.
AnglE supports three dataset formats. Choose based on your task:
| Format | Columns | Description | Use Case |
|---|---|---|---|
| Format A | text1, text2, label | Paired texts with similarity scores (0-1) | Similarity scoring |
| Format B | query, positive | Query-document pairs | Retrieval without hard negatives |
| Format C | query, positive, negative | Query with positive and negative samples | Contrastive learning |
Notes:
- All formats use HuggingFace
datasets.Dataset text1,text2,query,positive, andnegativecan bestrorList[str](random sampling for lists)
Single GPU:
CUDA_VISIBLE_DEVICES=0 angle-trainer --helpMulti-GPU with FSDP:
CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \ --multi_gpu \ --num_processes 4 \ --main_process_port 2345 \ --config_file examples/FSDP/fsdp_config.yaml \ -m angle_emb.angle_trainer \ --gradient_checkpointing 1 \ --use_reentrant 0 \ ...Multi-GPU (Standard):
CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \ --multi_gpu \ --num_processes 4 \ --main_process_port 2345 \ -m angle_emb.angle_trainer \ --model_name_or_path YOUR_MODEL \ --train_name_or_path YOUR_DATASET \ ...📁 More examples: examples/Training
fromdatasetsimportload_datasetfromangle_embimportAnglE# Step 1: Load pretrained modelangle=AnglE.from_pretrained( 'SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=128, pooling_strategy='cls' ).cuda() # Step 2: Prepare dataset (Format A example)ds=load_dataset('mteb/stsbenchmark-sts') ds=ds.map(lambdaobj:{"text1": str(obj["sentence1"]), "text2": str(obj['sentence2']), "label": obj['score'] }) ds=ds.select_columns(["text1", "text2", "label"]) # Step 3: Train the modelangle.fit( train_ds=ds['train'].shuffle(), valid_ds=ds['validation'], output_dir='ckpts/sts-b', batch_size=32, epochs=5, learning_rate=2e-5, save_steps=100, eval_steps=1000, warmup_steps=0, gradient_accumulation_steps=1, loss_kwargs={'cosine_w': 1.0, 'ibn_w': 1.0, 'angle_w': 0.02, 'cosine_tau': 20, 'ibn_tau': 20, 'angle_tau': 20 }, fp16=True, logging_steps=100 ) # Step 4: Evaluatecorrcoef=angle.evaluate(ds['test']) print('Spearman\'s corrcoef:', corrcoef)| Model Type | CLI Flags | Description |
|---|---|---|
| LLM | --is_llm 1 + LoRA params | Must manually enable LLM mode |
| BiLLM | --apply_billm 1 --billm_model_class LlamaForCausalLM | Bidirectional LLMs (guide) |
| Espresso (ESE) | --apply_ese 1 --ese_kl_temperature 1.0 --ese_compression_size 256 | Matryoshka-style embeddings |
| Format | Flag | Applies To |
|---|---|---|
| Format A | --text_prompt "text:{text}" | Both text1 and text2 |
| Format B/C | --query_prompt "query:{text}" | query field |
| Format B/C | --doc_prompt "document:{text}" | positive and negative fields |
Adapt old datasets without modification:
# CLI --column_rename_mapping "text:query"# Python column_rename_mapping={"text": "query"}Convert trained models to sentence-transformers format:
python scripts/convert_to_sentence_transformers.py --help| Format | Recommendation |
|---|---|
| Format A | Increase cosine_w or decrease ibn_w |
| Format B | Only tune ibn_w and ibn_tau |
| Format C | Set cosine_w=0, angle_w=0.02, and configure cln_w + ibn_w |
Prevent Catastrophic Forgetting:
- Set
teacher_name_or_pathfor knowledge distillation - Use same model path for self-distillation
⚠️ Ensure teacher and student use the same tokenizer
| Task | Status | Notes |
|---|---|---|
| Training | SentenceTransformers has AnglE loss, but use official angle_emb for best results | |
| Inference | ✅ Full | Convert trained models: examples/convert_to_sentence_transformers.py |
If you use our code and pre-trained models, please support us by citing our work as follows:
@article{li2023angle, title={AnglE-optimized Text Embeddings}, author={Li, Xianming and Li, Jing}, journal={arXiv preprint arXiv:2309.12871}, year={2023} }| 📅 | Description |
|---|---|
| 2025 Jan | v0.6.0 - Major refactoring 🎉: • Removed AngleDataTokenizer - no need to pre-tokenize datasets!• Removed DatasetFormats class - use string literals ('A', 'B', 'C')• Removed auto-detection of LLM models - set is_llm manually• Renamed --prompt_template to --text_prompt (Format A only)• Added --query_prompt and --doc_prompt for Format B/C• Added --column_rename_mapping to adapt old datasets without modification• Updated data formats: Format B/C now use query, positive, negative fields• Support list-based sampling in Format B/C • Updated examples to use accelerate launch• See MIGRATION_GUIDE.md for upgrade instructions |
| 2024 May 21 | support Espresso Sentence Embeddings |
| 2024 Feb 7 | support training with only positive pairs (Format C: query, positive) |
| 2023 Dec 4 | Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1 |
| 2023 Nov 2 | Release an English pretrained model: SeanLee97/angle-llama-13b-nli |
| 2023 Oct 28 | Release two chinese pretrained models: SeanLee97/angle-roberta-wwm-base-zhnli-v1 and SeanLee97/angle-llama-7b-zhnli-v1; Add chinese README.md |
If you have any questions or suggestions, please feel free to contact us via email: [email protected]
This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.