SyllaMoBert-grc-macronizer-v1
SyllaMoBERT-grc-macronizer-v1 is a token classification model designed for the macronization of Ancient Greek. It predicts the syllabic quantity—long or short—of dichrona, which are open syllables whose length depends on morphological or phonological context.
The model was evaluated using an 80/10/10 train/dev/test split and achieved the following accuracy:
- 97.9% on open syllables with short dichrona
- 99.0% on open syllables with long dichrona
- 99.8% on the (trivially predictable) class of heavy syllables
This makes SyllaMoBert-grc-macronizer-v1 a useful tool for tasks involving prosody, metrical analysis.
This model is trained on data generated by Albin Thörn Cleland’s rule-based macronizer. It is a finetuned version of a ModernBERT model trained from skratch on syllabified Ancient Greek texts using the base model Ericu950/SyllaMoBert-grc-v1
.
Quick Start
First, install the syllabification utility:
pip install syllagreek_utils==0.1.0
Then run the following code:
import torch
from transformers import PreTrainedTokenizerFast, ModernBertForTokenClassification
from syllagreek_utils import preprocess_greek_line, syllabify_joined
from torch.nn.functional import softmax
# Load model and tokenizer
model_path = "Ericu950/SyllaMoBert-grc-macronizer-v1"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
model = ModernBertForTokenClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# Input line
line = "φάσγανον Ἀσσυρίοιο παρήορον ἐκ τελαμῶνος"
# Preprocess and syllabify
tokens = preprocess_greek_line(line)
syllables = syllabify_joined(tokens)
print("Syllables:", syllables)
# Tokenize
inputs = tokenizer(
syllables,
is_split_into_words=True,
return_tensors="pt",
truncation=True,
max_length=2048,
padding="max_length"
)
inputs.pop("token_type_ids", None)
inputs = {k: v.to(device) for k, v in inputs.items()}
# Predict
with torch.no_grad():
logits = model(**inputs).logits
probs = softmax(logits, dim=-1)
predictions = torch.argmax(probs, dim=-1).squeeze().cpu().numpy()
# Align predictions with syllables
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
aligned_preds = []
syll_idx = 0
for tok in tokens:
if tok in tokenizer.all_special_tokens:
continue
if syll_idx >= len(syllables):
break
aligned_preds.append((syllables[syll_idx], predictions[syll_idx]))
syll_idx += 1
# Print results
label_map = {0: "clear", 1: "ambiguous → long", 2: "ambiguous → short"}
print("\nMacronization Predictions:")
for syll, label in aligned_preds:
print(f"{syll:>10} → {label_map[label]}")
Example Output:
Syllables: ['φάσ', 'γα', 'νο', 'νἀσ', 'συ', 'ρί', 'οι', 'ο', 'πα', 'ρή', 'ο', 'ρο', 'νἐκ', 'τε', 'λα', 'μῶ', 'νοσ']
Macronization Predictions:
φάσ → clear
γα → ambiguous → short
νο → clear
νἀσ → clear
συ → ambiguous → short
ρί → ambiguous → short
οι → clear
ο → clear
πα → ambiguous → short
ρή → clear
ο → clear
ρο → clear
νἐκ → clear
τε → clear
λα → ambiguous → short
μῶ → clear
νοσ → clear
⸻
📝 License
This project is released under the MIT License.
⸻
👥 Authors
This work is part of ongoing research by:
• Albin Thörn Cleland (Lund University)
• Eric Cullhed (Uppsala University)
⸻
💻 Acknowledgements
The computations were made possible by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council (grant agreement no. 2022-06725).
---
Would you like a Hugging Face model card in `.json` or `.md` format as well?
- Downloads last month
- 14
Model tree for Ericu950/SyllaMoBert-grc-macronizer-v1
Base model
Ericu950/SyllaMoBert-grc-v1