shlm-grc-en

Sentence embeddings for English and Ancient Greek

This model creates sentence embeddings in a shared vector space for Ancient Greek and English text.

The base model uses a modified version of the HLM architecture described in Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers (arXiv)

This model is trained to produce sentence embeddings using the multilingual knowledge distillation method and datasets described in Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation (arXiv).

This model was distilled from BAAI/bge-base-en-v1.5 for embedding English and Ancient Greek text.

Usage (Sentence-Transformers)

This model is currently incompatible with the latest version of the sentence-transformers library.

For now, either use HuggingFace Transformers directly (see below) or the following fork of sentence-transformers: https://github.com/kevinkrahn/sentence-transformers

You can use the model with sentence-transformers like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('kevinkrahn/shlm-grc-en')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an English sentence', 'Ὁ Παρθενών ἐστιν ἱερὸν καλὸν τῆς Ἀθήνης.']

# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output)

print("Sentence embeddings:")
print(sentence_embeddings)

Citing & Authors

If you use this model please cite the following papers:

@inproceedings{riemenschneider-krahn-2024-heidelberg,
    title = "Heidelberg-Boston @ {SIGTYP} 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers",
    author = "Riemenschneider, Frederick  and
      Krahn, Kevin",
    editor = "Hahn, Michael  and
      Sorokin, Alexey  and
      Kumar, Ritesh  and
      Shcherbakov, Andreas  and
      Otmakhova, Yulia  and
      Yang, Jinrui  and
      Serikov, Oleg  and
      Rani, Priya  and
      Ponti, Edoardo M.  and
      Murado{\u{g}}lu, Saliha  and
      Gao, Rena  and
      Cotterell, Ryan  and
      Vylomova, Ekaterina",
    booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",
    month = mar,
    year = "2024",
    address = "St. Julian's, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.sigtyp-1.16",
    pages = "131--141",
}

@inproceedings{krahn-etal-2023-sentence,
    title = "Sentence Embedding Models for {A}ncient {G}reek Using Multilingual Knowledge Distillation",
    author = "Krahn, Kevin  and
      Tate, Derrick  and
      Lamicela, Andrew C.",
    editor = "Anderson, Adam  and
      Gordin, Shai  and
      Li, Bin  and
      Liu, Yudong  and
      Passarotti, Marco C.",
    booktitle = "Proceedings of the Ancient Language Processing Workshop",
    month = sep,
    year = "2023",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2023.alp-1.2",
    pages = "13--22",
}