AnglE 📐

Sponsored by Mixedbread

For more detailed usage, please read the 📘 document:https://angle.readthedocs.io/en/latest/index.html

📢 Train/Infer Powerful Sentence Embeddings with AnglE. This library is from the paper: AnglE: Angle-optimized Text Embeddings. It allows for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. AnglE is also a general sentence embedding inference framework, allowing for infering a variety of transformer-based sentence embeddings.

✨ Features

Loss:

📐 AnglE loss (ACL24)
⚖ Contrastive loss
📏 CoSENT loss
☕️ Espresso loss (ICLR 2025, a.k.a 2DMSE, detail: README_ESE)

Backbones:

BERT-based models (BERT, RoBERTa, ModernBERT, etc.)
LLM-based models (LLaMA, Mistral, Qwen, etc.)
Bi-directional LLM-based models (LLaMA, Mistral, Qwen, OpenELMo, etc.. refer to: https://github.com/WhereIsAI/BiLLM)

Training:

Single-GPU training
Multi-GPU training

More features will be added in the future.

🏆 Achievements

📅 May 16, 2024 | Paper "AnglE: Angle-optimized Text Embeddings" is accepted by ACL 2024 Main Conference.

📅 Mar 13, 2024 | Paper "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings" is accepted by NAACL 2024 Main Conference.

📅 Mar 8, 2024 | 🍞 mixedbread's embedding (mixedbread-ai/mxbai-embed-large-v1) achieves SOTA on the MTEB Leaderboard with an average score of 64.68! The model is trained using AnglE. Congrats mixedbread!

📅 Dec 4, 2023 | Our universal sentence embedding WhereIsAI/UAE-Large-V1 achieves SOTA on the MTEB Leaderboard with an average score of 64.64! The model is trained using AnglE.

📅 Dec, 2023 | AnglE achieves SOTA performance on the STS Bechmark Semantic Textual Similarity!

🤗 Official Pretrained Models

BERT-based models:

🤗 HF	Max Tokens	Pooling Strategy	Scenario
WhereIsAI/UAE-Large-V1	512	cls	English, General-purpose
WhereIsAI/UAE-Code-Large-V1	512	cls	Code Similarity
WhereIsAI/pubmed-angle-base-en	512	cls	Medical Similarity
WhereIsAI/pubmed-angle-large-en	512	cls	Medical Similarity

LLM-based models:

🤗 HF (lora weight)	Backbone	Max Tokens	Prompts	Pooling Strategy	Scenario
SeanLee97/angle-llama-13b-nli	NousResearch/Llama-2-13b-hf	4096	`Prompts.A`	last token	English, Similarity Measurement
SeanLee97/angle-llama-7b-nli-v2	NousResearch/Llama-2-7b-hf	4096	`Prompts.A`	last token	English, Similarity Measurement

💡 You can find more third-party embeddings trained with AnglE in HuggingFace Collection

🚀 Quick Start

⬇️ Installation

use uv

uv pip install -U angle-emb

or pip

pip install -U angle-emb

🔍 Inference

1️⃣ BERT-based Models

Option A: With Prompts (for Retrieval Tasks)

Use prompts with {text} as placeholder. Check available prompts via Prompts.list_prompts().

fromangle_embimportAnglE, Promptsfromangle_emb.utilsimportcosine_similarity# Load modelangle=AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda() # Encode query with prompt, documents without promptqv=angle.encode(['what is the weather?'], to_numpy=True, prompt=Prompts.C) doc_vecs=angle.encode([ 'The weather is great!', 'it is rainy today.', 'i am going to bed' ], to_numpy=True) # Calculate similarityfordvindoc_vecs: print(cosine_similarity(qv[0], dv))

Option B: Without Prompts (for Similarity Tasks)

fromangle_embimportAnglEfromangle_emb.utilsimportcosine_similarity# Load modelangle=AnglE.from_pretrained('WhereIsAI/UAE-Large-V1', pooling_strategy='cls').cuda() # Encode documentsdoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ]) # Calculate pairwise similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))

2️⃣ LLM-based Models

For LoRA-based models, specify both the backbone model and LoRA weights. Always set is_llm=True for LLM models.

importtorchfromangle_embimportAnglE, Promptsfromangle_emb.utilsimportcosine_similarity# Load LLM with LoRA weightsangle=AnglE.from_pretrained( 'NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/angle-llama-7b-nli-v2', pooling_strategy='last', is_llm=True, torch_dtype=torch.float16 ).cuda() # Encode with promptdoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ], prompt=Prompts.A) # Calculate similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))

3️⃣ BiLLM-based Models

Enable bidirectional LLMs with apply_billm=True and specify the model class.

importosimporttorchfromangle_embimportAnglEfromangle_emb.utilsimportcosine_similarity# Set BiLLM environment variableos.environ['BiLLM_START_INDEX'] ='31'# Load BiLLM modelangle=AnglE.from_pretrained( 'NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/bellm-llama-7b-nli', pooling_strategy='last', is_llm=True, apply_billm=True, billm_model_class='LlamaForCausalLM', torch_dtype=torch.float16 ).cuda() # Encode with custom promptdoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ], prompt='The representative word for sentence{text} is:"') # Calculate similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))

4️⃣ Espresso/Matryoshka Models

Truncate layers and embedding dimensions for flexible model compression.

fromangle_embimportAnglEfromangle_emb.utilsimportcosine_similarity# Load modelangle=AnglE.from_pretrained('mixedbread-ai/mxbai-embed-2d-large-v1', pooling_strategy='cls').cuda() # Truncate to specific layerangle=angle.truncate_layer(layer_index=22) # Encode with truncated embedding sizedoc_vecs=angle.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ], embedding_size=768) # Calculate similarityfori, dv1inenumerate(doc_vecs): fordv2indoc_vecs[i+1:]: print(cosine_similarity(dv1, dv2))

5️⃣ Third-party Models

Load any transformer-based models (e.g., sentence-transformers, BAAI/bge, etc.) using AnglE.

fromangle_embimportAnglE# Load third-party modelmodel=AnglE.from_pretrained('mixedbread-ai/mxbai-embed-large-v1', pooling_strategy='cls').cuda() # Encode textvec=model.encode('hello world', to_numpy=True) print(vec)

⚡ Batch Inference

Speed up inference with the batched library (recommended for large-scale processing).

uv pip install batched

importbatchedfromangle_embimportAnglE# Load modelmodel=AnglE.from_pretrained("WhereIsAI/UAE-Large-V1", pooling_strategy='cls').cuda() # Enable dynamic batchingmodel.encode=batched.dynamically(model.encode, batch_size=64) # Encode large batchvecs=model.encode([ 'The weather is great!', 'The weather is very good!', 'i am going to bed' ] *50)

🕸️ Custom Training

💡 For complete details, see the official training documentation.

🗂️ Step 1: Prepare Your Dataset

AnglE supports three dataset formats. Choose based on your task:

Format	Columns	Description	Use Case
Format A	`text1`, `text2`, `label`	Paired texts with similarity scores (0-1)	Similarity scoring
Format B	`query`, `positive`	Query-document pairs	Retrieval without hard negatives
Format C	`query`, `positive`, `negative`	Query with positive and negative samples	Contrastive learning

Notes:

All formats use HuggingFace datasets.Dataset
text1, text2, query, positive, and negative can be str or List[str] (random sampling for lists)

🚂 Step 2: Training Methods

Option A: CLI Training (Recommended)

Single GPU:

CUDA_VISIBLE_DEVICES=0 angle-trainer --help

Multi-GPU with FSDP:

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \ --multi_gpu \ --num_processes 4 \ --main_process_port 2345 \ --config_file examples/FSDP/fsdp_config.yaml \ -m angle_emb.angle_trainer \ --gradient_checkpointing 1 \ --use_reentrant 0 \ ...

Multi-GPU (Standard):

CUDA_VISIBLE_DEVICES=0,1,2,3 WANDB_MODE=disabled accelerate launch \ --multi_gpu \ --num_processes 4 \ --main_process_port 2345 \ -m angle_emb.angle_trainer \ --model_name_or_path YOUR_MODEL \ --train_name_or_path YOUR_DATASET \ ...

📁 More examples: examples/Training

Option B: Python API Training

fromdatasetsimportload_datasetfromangle_embimportAnglE# Step 1: Load pretrained modelangle=AnglE.from_pretrained( 'SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=128, pooling_strategy='cls' ).cuda() # Step 2: Prepare dataset (Format A example)ds=load_dataset('mteb/stsbenchmark-sts') ds=ds.map(lambdaobj:{"text1": str(obj["sentence1"]), "text2": str(obj['sentence2']), "label": obj['score'] }) ds=ds.select_columns(["text1", "text2", "label"]) # Step 3: Train the modelangle.fit( train_ds=ds['train'].shuffle(), valid_ds=ds['validation'], output_dir='ckpts/sts-b', batch_size=32, epochs=5, learning_rate=2e-5, save_steps=100, eval_steps=1000, warmup_steps=0, gradient_accumulation_steps=1, loss_kwargs={'cosine_w': 1.0, 'ibn_w': 1.0, 'angle_w': 0.02, 'cosine_tau': 20, 'ibn_tau': 20, 'angle_tau': 20 }, fp16=True, logging_steps=100 ) # Step 4: Evaluatecorrcoef=angle.evaluate(ds['test']) print('Spearman\'s corrcoef:', corrcoef)

⚙️ Advanced Configuration

Training Special Models

Model Type	CLI Flags	Description
LLM	`--is_llm 1` + LoRA params	Must manually enable LLM mode
BiLLM	`--apply_billm 1 --billm_model_class LlamaForCausalLM`	Bidirectional LLMs (guide)
Espresso (ESE)	`--apply_ese 1 --ese_kl_temperature 1.0 --ese_compression_size 256`	Matryoshka-style embeddings

Applying Prompts

Format	Flag	Applies To
Format A	`--text_prompt "text:{text}"`	Both `text1` and `text2`
Format B/C	`--query_prompt "query:{text}"`	`query` field
Format B/C	`--doc_prompt "document:{text}"`	`positive` and `negative` fields

Column Mapping (Legacy Compatibility)

Adapt old datasets without modification:

# CLI --column_rename_mapping "text:query"# Python column_rename_mapping={"text": "query"}

Model Conversion

Convert trained models to sentence-transformers format:

python scripts/convert_to_sentence_transformers.py --help

💡 Fine-tuning Tips

📖 Full documentation

Format	Recommendation
Format A	Increase `cosine_w` or decrease `ibn_w`
Format B	Only tune `ibn_w` and `ibn_tau`
Format C	Set `cosine_w=0`, `angle_w=0.02`, and configure `cln_w` + `ibn_w`

Prevent Catastrophic Forgetting:

Set teacher_name_or_path for knowledge distillation
Use same model path for self-distillation
⚠️ Ensure teacher and student use the same tokenizer

🔄 Integration with sentence-transformers

Task	Status	Notes
Training	⚠️ Partial	SentenceTransformers has AnglE loss, but use official `angle_emb` for best results
Inference	✅ Full	Convert trained models: `examples/convert_to_sentence_transformers.py`

🫡 Citation

If you use our code and pre-trained models, please support us by citing our work as follows:

@article{li2023angle, title={AnglE-optimized Text Embeddings}, author={Li, Xianming and Li, Jing}, journal={arXiv preprint arXiv:2309.12871}, year={2023} }

📜 ChangeLogs

📅	Description
2025 Jan	v0.6.0 - Major refactoring 🎉: • Removed `AngleDataTokenizer` - no need to pre-tokenize datasets! • Removed `DatasetFormats` class - use string literals ('A', 'B', 'C') • Removed auto-detection of LLM models - set `is_llm` manually • Renamed `--prompt_template` to `--text_prompt` (Format A only) • Added `--query_prompt` and `--doc_prompt` for Format B/C • Added `--column_rename_mapping` to adapt old datasets without modification • Updated data formats: Format B/C now use `query`, `positive`, `negative` fields • Support list-based sampling in Format B/C • Updated examples to use `accelerate launch` • See MIGRATION_GUIDE.md for upgrade instructions
2024 May 21	support Espresso Sentence Embeddings
2024 Feb 7	support training with only positive pairs (Format C: query, positive)
2023 Dec 4	Release a universal English sentence embedding model: WhereIsAI/UAE-Large-V1
2023 Nov 2	Release an English pretrained model: `SeanLee97/angle-llama-13b-nli`
2023 Oct 28	Release two chinese pretrained models: `SeanLee97/angle-roberta-wwm-base-zhnli-v1` and `SeanLee97/angle-llama-7b-zhnli-v1`; Add chinese README.md

📧 Contact

If you have any questions or suggestions, please feel free to contact us via email: [email protected]

© License

This project is licensed under the MIT License. For the pretrained models, please refer to the corresponding license of the models.

Name		Name	Last commit message	Last commit date
Latest commit History 329 Commits
.github/workflows		.github/workflows
angle_emb		angle_emb
assets		assets
docs		docs
examples		examples
scripts		scripts
tests		tests
.gitignore		.gitignore
.python-version		.python-version
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
MIGRATION_GUIDE.md		MIGRATION_GUIDE.md
README.md		README.md
README_2DMSE.md		README_2DMSE.md
README_ESE.md		README_ESE.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

License

SeanLee97/AnglE

Folders and files

Latest commit

History

Repository files navigation

AnglE 📐

✨ Features

🏆 Achievements

🤗 Official Pretrained Models

🚀 Quick Start

⬇️ Installation

🔍 Inference

1️⃣ BERT-based Models

2️⃣ LLM-based Models

3️⃣ BiLLM-based Models

4️⃣ Espresso/Matryoshka Models

5️⃣ Third-party Models

⚡ Batch Inference

🕸️ Custom Training

🗂️ Step 1: Prepare Your Dataset

🚂 Step 2: Training Methods

Option A: CLI Training (Recommended)

Option B: Python API Training

⚙️ Advanced Configuration

Training Special Models

Applying Prompts

Column Mapping (Legacy Compatibility)

Model Conversion

💡 Fine-tuning Tips

🔄 Integration with sentence-transformers

🫡 Citation

📜 ChangeLogs

📧 Contact

© License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages