Update README.md

5024d94 verified 10 days ago

7.67 kB

	---
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	license: mit
	datasets:
	- avemio/German-RAG-EMBEDDING-TRIPLES-HESSIAN-AI
	language:
	- de
	- en
	base_model:
	- avemio/German-RAG-UAE-LARGE-V1-TRIPLES-HESSIAN-AI
	- WhereIsAI/UAE-Large-V1
	base_model_relation: merge
	---

	# German-RAG-UAE-LARGE-V1-TRIPLES-MERGED-HESSIAN-AI

	This is a [sentence-transformers](https://www.SBERT.net) model trained on this [Dataset](https://huggingface.co/datasets/avemio/German-RAG-Embedding-Triples-Hessian-AI) with roughly 300k Triple-Samples. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
	It was merged with the Base-Model [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) again to maintain performance on other languages again.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	<!-- - Base model: [Unknown](https://huggingface.co/unknown) -->
	- Maximum Sequence Length: 512 tokens
	- Output Dimensionality: 1024 tokens
	- Similarity Function: Cosine Similarity
	<!-- - Training Dataset: Unknown -->
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
	(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	(2): Normalize()
	)
	```

	## Evaluation MTEB-Tasks

	### Classification
	- AmazonCounterfactualClassification
	- AmazonReviewsClassification
	- MassiveIntentClassification
	- MassiveScenarioClassification
	- MTOPDomainClassification
	- MTOPIntentClassification

	### Pair Classification
	- FalseFriendsGermanEnglish
	- PawsXPairClassification

	### Retrieval
	- GermanQuAD-Retrieval
	- GermanDPR

	### STS (Semantic Textual Similarity)
	- GermanSTSBenchmark

	\| TASK \| [UAE](https://huggingface.co/WhereIsAI/UAE-Large-V1/) \| [German-RAG-UAE](https://huggingface.co/avemio/German-RAG-UAE-LARGE-V1-TRIPLES-HESSIAN-AI/) \| Merged-UAE \| German-RAG vs. UAE \| Merged vs. UAE \|
	\|-------------------------------------\|-------\|----------\|------------\|--------------\|----------------\|
	\| AmazonCounterfactualClassification \| 0.5650 \| 0.5449 \| 0.5401 \| -2.01% \| -2.48% \|
	\| AmazonReviewsClassification \| 0.2738 \| 0.2745 \| 0.2782 \| 0.08% \| 0.44% \|
	\| FalseFriendsGermanEnglish \| 0.4808 \| 0.4777 \| 0.4703 \| -0.32% \| -1.05% \|
	\| GermanQuAD-Retrieval \| 0.7811 \| 0.8353 \| 0.8628 \| 5.42% \| 8.18% \|
	\| GermanSTSBenchmark \| 0.6421 \| 0.6568 \| 0.6754 \| 1.47% \| 3.33% \|
	\| MassiveIntentClassification \| 0.5139 \| 0.4884 \| 0.4714 \| -2.55% \| -4.25% \|
	\| MassiveScenarioClassification \| 0.6062 \| 0.5837 \| 0.6111 \| -2.25% \| 0.49% \|
	\| GermanDPR \| 0.6750 \| 0.7210 \| 0.7507 \| 4.60% \| 7.57% \|
	\| MTOPDomainClassification \| 0.7625 \| 0.7450 \| 0.7686 \| -1.75% \| 0.61% \|
	\| MTOPIntentClassification \| 0.4994 \| 0.4516 \| 0.4413 \| -4.77% \| -5.80% \|
	\| PawsXPairClassification \| 0.5452 \| 0.5077 \| 0.5162 \| -3.76% \| -2.90% \|


	## Evaluation on German-RAG-EMBEDDING-BENCHMARK

	Accuracy is calculated by evaluating if the relevant context is the highest ranking embedding of the whole context array.
	See Eval-Dataset and Evaluation Code [here](https://huggingface.co/datasets/avemio/German-RAG-EMBEDDING-BENCHMARK)

	\| Model Name \| Accuracy \|
	\|-------------------------------------------------\|-----------\|
	\| [bge-m3](https://huggingface.co/BAAI/bge-m3 ) \| 0.8806 \|
	\| [UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) \| 0.8393 \|
	\| [German-RAG-BGE-M3-TRIPLES-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-BGE-M3-TRIPLES-HESSIAN-AI) \| 0.8857 \|
	\| [German-RAG-BGE-M3-TRIPLES-MERGED-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-BGE-M3-TRIPLES-MERGED-HESSIAN-AI) \| 0.8866 \|
	\| [German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI) \| 0.8866 \|
	\| [German-RAG-UAE-LARGE-V1-TRIPLES-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-UAE-LARGE-V1-TRIPLES-HESSIAN-AI) \| 0.8763 \|
	\| [German-RAG-UAE-LARGE-V1-TRIPLES-MERGED-HESSIAN-AI](https://huggingface.co/avemio/German-RAG-UAE-LARGE-V1-TRIPLES-MERGED-HESSIAN-AI) \| 0.8771 \|


	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("avemio-digital/UAE-Large-V1_Triples_Merged_with_base")
	# Run inference
	sentences = [
	'The weather is lovely today.',
	"It's so sunny outside!",
	'He drove to the stadium.',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 1024]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Framework Versions
	- Python: 3.10.12
	- Sentence Transformers: 3.2.1
	- Transformers: 4.44.2
	- PyTorch: 2.5.0+cu121
	- Accelerate: 0.34.2
	- Datasets: 2.19.0
	- Tokenizers: 0.19.1

	## Citation

	```
	@article{li2023angle,
	title={AnglE-optimized Text Embeddings},
	author={Li, Xianming and Li, Jing},
	journal={arXiv preprint arXiv:2309.12871},
	year={2023}
	}
	```



	## The German-RAG AI Team
	[Marcel Rosiak](https://de.linkedin.com/in/marcel-rosiak)
	[Soumya Paul](https://de.linkedin.com/in/soumya-paul-1636a68a)
	[Siavash Mollaebrahim](https://de.linkedin.com/in/siavash-mollaebrahim-4084b5153?trk=people-guest_people_search-card)
	[Zain ul Haq](https://de.linkedin.com/in/zain-ul-haq-31ba35196)