SyllaMoBert-grc-v1: A Syllable-Based ModernBERT for Ancient Greek
SyllaMoBert-grc-v1 is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts, tokenized at the syllable level.
It is specifically designed to tackle tasks involving prosody, meter, and rhyme.
Input needs to be preprocessed and syllabified with syllagreek_utils==0.1.0
!pip install syllagreek_utils==0.1.0 tokens = preprocess_greek_line(line) syllables = syllabify_joined(tokens)
This will convert line, e.g. Κατέβην χθὲς εἰς Πειραιᾶ
into κα τέ βην χθὲ σεἰσ πει ραι ᾶ
Observe that words are fused at the syllabic level.
Load and test the model like this:
# First install the pretokenizer that syllabifies ancient greek according to principles that the model adhere to
!pip install syllagreek_utils==0.1.0
#import what's needed
import random
import torch
from transformers import AutoTokenizer, ModernBertForMaskedLM
from syllagreek_utils import preprocess_greek_line, syllabify_joined # this is the custom preprocessor & syllabifier
# Set the computation device: GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load pretrained model and tokenizer from Hugging Face
checkpoint = "Ericu950/SyllaMoBert-grc-v1"
model = ModernBertForMaskedLM.from_pretrained(checkpoint).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Input Greek text line
line = 'φυήν τ ἄγχιστα ἴσως τὸ ἐξ εἴδους καὶ ψυχῆς φυὴν καλεῖ'
# Apply custom preprocessing: tokenization and normalization
tokens = preprocess_greek_line(line)
# Apply syllabification to tokens, joining them into syllables
syllables = syllabify_joined(tokens)
# Randomly select a syllable index to mask
mask_idx = random.randint(0, len(syllables) - 1)
# Replace the selected syllable with the tokenizer's mask token (e.g., [MASK])
syllables[mask_idx] = tokenizer.mask_token
print("Masked syllables:", syllables)
# Tokenize the masked syllables and prepare inputs for the model
# is_split_into_words=True tells the tokenizer not to split again
inputs = tokenizer(syllables, is_split_into_words=True, return_tensors="pt").to(device)
# Identify the index of the mask token in the input tensor
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
# Disable gradient calculation since we're just doing inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits # raw prediction scores for each token
# Extract the logits corresponding to the masked position
mask_logits = logits[0, mask_token_index[0]]
# Get the top 5 predicted token IDs for the masked position
top_tokens = torch.topk(mask_logits, 5, dim=-1).indices
# Decode and print the top 5 predicted tokens for the masked syllable
print("Top predictions for [MASK]:")
for token_id in top_tokens:
print("→", tokenizer.decode([token_id.item()]))
This should print
Masked syllables: ['φυ', '[MASK]', 'τἄγ', 'χισ', 'τα', 'ἴ', 'σωσ', 'τὸ', 'ἐκ', 'σεἴ', 'δουσ', 'καὶπ', 'συ', 'χῆσ', 'φυ', 'ὴν', 'κα', 'λεῖ']
Top predictions for [MASK]:
→ ήν
→ ῆσ
→ ῇ
→ ὴν
→ ῆ
License
MIT License.
Authors
This work is part of ongoing research by Eric Cullhed (Uppsala University) and Albin Thörn Cleland (Lund University).
Acknowledgements
The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.
- Downloads last month
- 1,092