SyllaMoBert-grc-macronizer-v1

SyllaMoBERT-grc-macronizer-v1 is a token classification model designed for the macronization of Ancient Greek. It predicts the syllabic quantity—long or short—of dichrona, which are open syllables whose length depends on morphological or phonological context.

The model was evaluated using an 80/10/10 train/dev/test split and achieved the following accuracy:

  • 97.9% on open syllables with short dichrona
  • 99.0% on open syllables with long dichrona
  • 99.8% on the (trivially predictable) class of heavy syllables

This makes SyllaMoBert-grc-macronizer-v1 a useful tool for tasks involving prosody, metrical analysis.

This model is trained on data generated by Albin Thörn Cleland’s rule-based macronizer. It is a finetuned version of a ModernBERT model trained from skratch on syllabified Ancient Greek texts using the base model Ericu950/SyllaMoBert-grc-v1.


Quick Start

First, install the syllabification utility:

pip install syllagreek_utils==0.1.0

Then run the following code:

import torch
from transformers import PreTrainedTokenizerFast, ModernBertForTokenClassification
from syllagreek_utils import preprocess_greek_line, syllabify_joined
from torch.nn.functional import softmax

# Load model and tokenizer
model_path = "Ericu950/SyllaMoBert-grc-macronizer-v1"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
model = ModernBertForTokenClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Input line
line = "φάσγανον Ἀσσυρίοιο παρήορον ἐκ τελαμῶνος"

# Preprocess and syllabify
tokens = preprocess_greek_line(line)
syllables = syllabify_joined(tokens)
print("Syllables:", syllables)

# Tokenize
inputs = tokenizer(
    syllables,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    max_length=2048,
    padding="max_length"
)
inputs.pop("token_type_ids", None)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Predict
with torch.no_grad():
    logits = model(**inputs).logits
    probs = softmax(logits, dim=-1)
    predictions = torch.argmax(probs, dim=-1).squeeze().cpu().numpy()

# Align predictions with syllables
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
aligned_preds = []
syll_idx = 0
for tok in tokens:
    if tok in tokenizer.all_special_tokens:
        continue
    if syll_idx >= len(syllables):
        break
    aligned_preds.append((syllables[syll_idx], predictions[syll_idx]))
    syll_idx += 1

# Print results
label_map = {0: "clear", 1: "ambiguous → long", 2: "ambiguous → short"}
print("\nMacronization Predictions:")
for syll, label in aligned_preds:
    print(f"{syll:>10} → {label_map[label]}")

Example Output:

Syllables: ['φάσ', 'γα', 'νο', 'νἀσ', 'συ', 'ρί', 'οι', 'ο', 'πα', 'ρή', 'ο', 'ρο', 'νἐκ', 'τε', 'λα', 'μῶ', 'νοσ']

Macronization Predictions:
       φάσ → clear
        γα → ambiguous → short
        νο → clear
       νἀσ → clear
        συ → ambiguous → short
        ρί → ambiguous → short
        οι → clear
         ο → clear
        πα → ambiguous → short
        ρή → clear
         ο → clear
        ρο → clear
       νἐκ → clear
        τε → clear
        λα → ambiguous → short
        μῶ → clear
       νοσ → clear



⸻

📝 License

This project is released under the MIT License.

⸻

👥 Authors

This work is part of ongoing research by:
    •	Albin Thörn Cleland (Lund University)
    •	Eric Cullhed (Uppsala University)

⸻

💻 Acknowledgements

The computations were made possible by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council (grant agreement no. 2022-06725).

---

Would you like a Hugging Face model card in `.json` or `.md` format as well?
Downloads last month
14
Safetensors
Model size
138M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ericu950/SyllaMoBert-grc-macronizer-v1

Finetuned
(1)
this model