Model Card for LuxEmbedder

Model Summary

LuxEmbedder is a sentence-transformers model that transforms sentences and paragraphs into 768-dimensional dense vectors, enabling tasks like clustering and semantic search, with a primary focus on Luxembourgish. Leveraging a cross-lingual approach, LuxEmbedder effectively handles Luxembourgish text while also mapping input from 108 other languages into a shared embedding space. For the full list of supported languages, refer to the sentence-transformers/LaBSE documentation, as LaBSE served as the foundation for LuxEmbedder.

This model was introduced in LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings (Philippy et al., 2024). It addresses the challenges of limited parallel data for Luxembourgish by creating LuxAlign, a high-quality, human-generated parallel dataset, which forms the basis for LuxEmbedder’s competitive performance across cross-lingual and monolingual tasks for Luxembourgish.

With the release of LuxEmbedder, we also provide a Luxembourgish paraphrase detection benchmark, ParaLux to encourage further exploration and development in NLP for Luxembourgish.

Example Usage

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import numpy as np
import pandas as pd

# Load the model
model = SentenceTransformer('fredxlpy/LuxEmbedder')

# Example sentences
data = pd.DataFrame({
    "id": ["lb1", "lb2", "lb3", "en1", "en2", "en3", "zh1", "zh2", "zh3"],
    "text": [
        "Moien, wéi geet et?",         # Luxembourgish: Hello, how are you?
        "D'Wieder ass haut schéin.",   # Luxembourgish: The weather is beautiful today.
        "Ech schaffen am Büro.",       # Luxembourgish: I work in the office.
        "Hello, how are you?",         
        "The weather is great today.", 
        "I work in an office.",        
        "你好, 你怎么样?",               # Chinese: Hello, how are you?
        "今天天气很好.",                 # Chinese: The weather is very good today.
        "我在办公室工作."                # Chinese: I work in an office.
    ]
})

# Encode the sentences to obtain sentence embeddings
embeddings = model.encode(data["text"].tolist(), convert_to_tensor=True)

# Compute the cosine similarity matrix
cosine_similarity_matrix = util.cos_sim(embeddings, embeddings).cpu().numpy()

# Create a DataFrame for the similarity matrix with "id" as row and column labels
similarity_df = pd.DataFrame(
    np.round(cosine_similarity_matrix, 2),
    index=data["id"],
    columns=data["id"]
)

# Display the similarity matrix
print("Cosine Similarity Matrix:")
print(similarity_df)

# Cosine Similarity Matrix:
# id    lb1   lb2   lb3   en1   en2   en3   zh1   zh2   zh3
# id                                                       
# lb1  1.00  0.60  0.42  0.96  0.59  0.40  0.95  0.62  0.43
# lb2  0.60  1.00  0.41  0.56  0.99  0.39  0.56  0.99  0.42
# lb3  0.42  0.41  1.00  0.44  0.42  0.99  0.46  0.43  0.99
# en1  0.96  0.56  0.44  1.00  0.55  0.43  0.99  0.58  0.46
# en2  0.59  0.99  0.42  0.55  1.00  0.40  0.55  0.99  0.43
# en3  0.40  0.39  0.99  0.43  0.40  1.00  0.44  0.41  0.99
# zh1  0.95  0.56  0.46  0.99  0.55  0.44  1.00  0.58  0.47
# zh2  0.62  0.99  0.43  0.58  0.99  0.41  0.58  1.00  0.44
# zh3  0.43  0.42  0.99  0.46  0.43  0.99  0.47  0.44  1.00

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 768, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
  (3): Normalize()
)

Citation

@misc{philippy2024luxembedder,
      title={LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings}, 
      author={Fred Philippy and Siwen Guo and Jacques Klein and Tegawendé F. Bissyandé},
      year={2024},
      eprint={2412.03331},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.03331}, 
}
Downloads last month
22
Safetensors
Model size
471M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for fredxlpy/LuxEmbedder

Finetuned
(31)
this model

Dataset used to train fredxlpy/LuxEmbedder