Skip to content

Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard

License

Notifications You must be signed in to change notification settings

SeanLee97/AnglE

Repository files navigation

EN | 简体中文

AnglE 📐

Sponsored by Mixedbread

For more detailed usage, please read the 📘 document:https://angle.readthedocs.io/en/latest/index.html

https://arxiv.org/abs/2309.12871PyPI versionPyPI DownloadsRead the docs

📢 Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.

✨ Features

Loss:

  • 📐 AnglE loss (ACL24)
  • ⚖ Contrastive loss
  • 📏 CoSENT loss
  • ☕️ Espresso loss (ICLR 2025, a.k.a 2DMSE, detail: README_ESE)

Backbones:

  • BERT-based models (BERT, RoBERTa, ModernBERT, etc.)
  • LLM-based models (LLaMA, Mistral, Qwen, etc.)
  • Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)

Training:

  • Single-GPU training
  • Multi-GPU training

http://makeapullrequest.com More features will be added in the future.

🏆 Achievements

📅 May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.

📅 Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.

📅 Mar 8, 2024 | 🍞 mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!

📅 Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.

📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!

🤗 Official Pretrained Models

BERT-based models:

🤗 HFMax TokensPooling StrategyScenario
WhereIsAI/UAE-Large-V1512clsEnglish, General-purpose
WhereIsAI/UAE-Code-Large-V1512clsCode Similarity
WhereIsAI/pubmed-angle-base-en512clsMedical Similarity
WhereIsAI/pubmed-angle-large-en512clsMedical Similarity

LLM-based models:

🤗 HF (lora weight)BackboneMax TokensPromptsPooling StrategyScenario
SeanLee97/angle-llama-13b-nliNousResearch/Llama-2-13b-hf4096Prompts.Alast tokenEnglish, Similarity Measurement
SeanLee97/angle-llama-7b-nli-v2NousResearch/Llama-2-7b-hf4096Prompts.Alast tokenEnglish, Similarity Measurement

💡 You can find more third-party embeddings trained with AnglE in HuggingFace Collection

🚀 Quick Start

⬇️ Installation

use uv

uv pip install -U angle-emb

or pip

pip install -U angle-emb

🔍 Inference

1️⃣ BERT-based Models

Open In Colab

Option A: With Prompts (for Retrieval Tasks)

Use prompts with {text} as placeholder. Check available prompts via Prompts.list_prompts().

fromangle_embimportAnglE, Promptsfromangle_emb.utilsimportcosine_similarity# Load modelangle=AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda() # Encode query with prompt, documents without promptqv=angle.encode(['what is the weather?'], to_numpy=True, prompt=Prompts.C) doc_vecs=angle.encode([ 'The weather is great!', 'it is rainy today.', 'i am going to bed' ], to_numpy=True) # Calculate similarityfordvindoc_vecs: print(cosine_similarity(qv[0], dv))

Option B: Without Prompts (for Similarity Tasks)

fromangle_embimportAnglEfromangle_emb.utilsimportcosine_similarity# Load modelangle=AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda() # Encode documentsdoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ]) # Calculate pairwise similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))

2️⃣ LLM-based Models

Open In Colab

For LoRA-based models, specify both the backbone model and LoRA weights. Always set is_llm=True for LLM models.

importtorchfromangle_embimportAnglE, Promptsfromangle_emb.utilsimportcosine_similarity# Load LLM with LoRA weightsangle=AnglE.from_pretrained( 'NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2', pooling_strategy='last', is_llm=True, torch_dtype=torch.float16 ).cuda() # Encode with promptdoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ], prompt=Prompts.A) # Calculate similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))

3️⃣ BiLLM-based Models

Open In Colab

Enable bidirectional LLMs with apply_billm=True and specify the model class.

importosimporttorchfromangle_embimportAnglEfromangle_emb.utilsimportcosine_similarity# Set BiLLM environment variableos.environ['BiLLM_START_INDEX'] ='31'# Load BiLLM modelangle=AnglE.from_pretrained( 'NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/bellm-llama-7b-nli', pooling_strategy='last', is_llm=True, apply_billm=True, billm_model_class='LlamaForCausalLM', torch_dtype=torch.float16 ).cuda() # Encode with custom promptdoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ], prompt='The representative word for sentence{text} is:"') # Calculate similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))

4️⃣ Espresso/Matryoshka Models

Open In Colab

Truncate layers and embedding dimensions for flexible model compression.

fromangle_embimportAnglEfromangle_emb.utilsimportcosine_similarity# Load modelangle=AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda() # Truncate to specific layerangle=angle.truncate_layer(layer_index=22) # Encode with truncated embedding sizedoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ], embedding_size=768) # Calculate similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))

5️⃣ Third-party Models

Load any transformer-based models (e.g., sentence-transformers, BAAI/bge, etc.) using AnglE.

fromangle_embimportAnglE# Load third-party modelmodel=AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda() # Encode textvec=model.encode('hello world', to_numpy=True) print(vec)

⚡ Batch Inference

Speed up inference with the batched library (recommended for large-scale processing).

uv pip install batched
importbatchedfromangle_embimportAnglE# Load modelmodel=AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda() # Enable dynamic batchingmodel.encode=batched.dynamically(model.encode, batch_size=64) # Encode large batchvecs=model.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ] *50)

🕸️ Custom Training

💡 For complete details, see the official training documentation.


🗂️ Step 1: Prepare Your Dataset

AnglE supports three dataset formats. Choose based on your task:

FormatColumnsDescriptionUse Case
Format Atext1, text2, labelPaired texts with similarity scores (0-1)Similarity scoring
Format Bquery, positiveQuery-document pairsRetrieval without hard negatives
Format Cquery, positive, negativeQuery with positive and negative samplesContrastive learning

Notes:

  • All formats use HuggingFace datasets.Dataset
  • text1, text2, query, positive, and negative can be str or List[str] (random sampling for lists)

🚂 Step 2: Training Methods

Option A: CLI Training (Recommended)

Single GPU:

CUDA_VISIBLE_DEVICES=0 angle-trainer --help

Multi-GPU with FSDP:

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \ --multi_gpu \ --num_processes 4 \ --main_process_port 2345 \ --config_file examples/FSDP/fsdp_config.yaml \ -m angle_emb.angle_trainer \ --gradient_checkpointing 1 \ --use_reentrant 0 \ ...

Multi-GPU (Standard):

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \ --multi_gpu \ --num_processes 4 \ --main_process_port 2345 \ -m angle_emb.angle_trainer \ --model_name_or_path YOUR_MODEL \ --train_name_or_path YOUR_DATASET \ ...

📁 More examples: examples/Training


Option B: Python API Training

Open In Colab

fromdatasetsimportload_datasetfromangle_embimportAnglE# Step 1: Load pretrained modelangle=AnglE.from_pretrained( 'SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=128, pooling_strategy='cls' ).cuda() # Step 2: Prepare dataset (Format A example)ds=load_dataset('mteb/stsbenchmark-sts') ds=ds.map(lambdaobj:{"text1": str(obj["sentence1"]), "text2": str(obj['sentence2']), "label": obj['score'] }) ds=ds.select_columns(["text1", "text2", "label"]) # Step 3: Train the modelangle.fit( train_ds=ds['train'].shuffle(), valid_ds=ds['validation'], output_dir='ckpts/sts-b', batch_size=32, epochs=5, learning_rate=2e-5, save_steps=100, eval_steps=1000, warmup_steps=0, gradient_accumulation_steps=1, loss_kwargs={'cosine_w': 1.0, 'ibn_w': 1.0, 'angle_w': 0.02, 'cosine_tau': 20, 'ibn_tau': 20, 'angle_tau': 20 }, fp16=True, logging_steps=100 ) # Step 4: Evaluatecorrcoef=angle.evaluate(ds['test']) print('Spearman\'s corrcoef:', corrcoef)

⚙️ Advanced Configuration

Training Special Models

Model TypeCLI FlagsDescription
LLM--is_llm 1 + LoRA paramsMust manually enable LLM mode
BiLLM--apply_billm 1 --billm_model_class LlamaForCausalLMBidirectional LLMs (guide)
Espresso (ESE)--apply_ese 1 --ese_kl_temperature 1.0 --ese_compression_size 256Matryoshka-style embeddings

Applying Prompts

FormatFlagApplies To
Format A--text_prompt "text:{text}"Both text1 and text2
Format B/C--query_prompt "query:{text}"query field
Format B/C--doc_prompt "document:{text}"positive and negative fields

Column Mapping (Legacy Compatibility)

Adapt old datasets without modification:

# CLI --column_rename_mapping "text:query"# Python column_rename_mapping={"text": "query"}

Model Conversion

Convert trained models to sentence-transformers format:

python scripts/convert_to_sentence_transformers.py --help

💡 Fine-tuning Tips

📖 Full documentation

FormatRecommendation
Format AIncrease cosine_w or decrease ibn_w
Format BOnly tune ibn_w and ibn_tau
Format CSet cosine_w=0, angle_w=0.02, and configure cln_w + ibn_w

Prevent Catastrophic Forgetting:

  • Set teacher_name_or_path for knowledge distillation
  • Use same model path for self-distillation
  • ⚠️ Ensure teacher and student use the same tokenizer

🔄 Integration with sentence-transformers

TaskStatusNotes
Training⚠️ PartialSentenceTransformers has AnglE loss, but use official angle_emb for best results
Inference✅ FullConvert trained models: examples/convert_to_sentence_transformers.py

🫡 Citation

If you use our code and pre-trained models, please support us by citing our work as follows:

@article{li2023angle, title={AnglE-optimized Text Embeddings}, author={Li, Xianming and Li, Jing}, journal={arXiv preprint arXiv:2309.12871}, year={2023} }

📜 ChangeLogs

📅Description
2025 Janv0.6.0 - Major refactoring 🎉:
• Removed AngleDataTokenizer - no need to pre-tokenize datasets!
• Removed DatasetFormats class - use string literals ('A', 'B', 'C')
• Removed auto-detection of LLM models - set is_llm manually
• Renamed --prompt_template to --text_prompt (Format A only)
• Added --query_prompt and --doc_prompt for Format B/C
• Added --column_rename_mapping to adapt old datasets without modification
• Updated data formats: Format B/C now use query, positive, negative fields
• Support list-based sampling in Format B/C
• Updated examples to use accelerate launch
• See MIGRATION_GUIDE.md for upgrade instructions
2024 May 21support Espresso Sentence Embeddings
2024 Feb 7support training with only positive pairs (Format C: query, positive)
2023 Dec 4Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1
2023 Nov 2Release an English pretrained model: SeanLee97/angle-llama-13b-nli
2023 Oct 28Release two chinese pretrained models: SeanLee97/angle-roberta-wwm-base-zhnli-v1 and SeanLee97/angle-llama-7b-zhnli-v1; Add chinese README.md

📧 Contact

If you have any questions or suggestions, please feel free to contact us via email: [email protected]

© License

This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.