About

The Vikhr base model was instruction tuned to generate labels that attribute provenance components to medieval Church Slavic text snippets.

Depending on the prompt (see below), labels will specify the following provenance information of the focus text:

its Church Slavic language stage (early, middle, late)
its Church Slavic dialect (south, east)
its geographical region of text origin.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "pirolen/Vikhr-HistoricalChurchSlavic"

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    max_length=1024,
    padding_side="left"
)

# add pad token
tokenizer.pad_token = tokenizer.bos_token
tokenizer.pad_token_id = tokenizer.bos_token_id

# load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# system prompt
system_prompt = "You are a historical linguist who can differentiate " \
    "three stages of Church Slavic: Early, Middle, Late, "\
    "and their respective regional dialects. You can reproduce " \
    "the type of orthographic, grammatical, and lexical variation " \
    "that is characteristic for specific cultural-geographical areas " \
    "for all these variations of Church Slavic."

# user prompt template
user_prompt_template = "You will see a text. " \
    "Identify its Church Slavic language stage, regional dialect, " \
    "and the historical geographic-cultural area where the text was " \
    "written. Attribute one of the following labels to identify " \
    "the language and regional origin of the text: " \
    "'Early Church Slavic, Eastern dialect, Kyivan Rus' region', " \
    "'Early Church Slavic, Southern dialect, Bulgaria region', " \
    "'Early Church Slavic, Southern dialect, Macedonia region', " \
    "'Early Church Slavic, Western dialect, Moravia region', " \
    "'Late Church Slavic, Eastern dialect, Muscovy region', " \
    "'Late Church Slavic, Southern dialect, Serbia region', " \
    "'Middle Church Slavic, Eastern dialect, Kyivan Rus' region', " \
    "'Middle Church Slavic, Eastern dialect, Muscovy region', " \
    "'Middle Church Slavic, Eastern dialect, Novgorod region', " \
    "'Middle Church Slavic, Eastern dialect, South of Rus' region', " \
    "'Middle Church Slavic, Eastern dialect, Suzdal region', " \
    "'Middle Church Slavic, Southern dialect, Macedonia/Serbia region'. " \
    "This is the sentence to be annotated: #sentence#"

# insert sentence to be classified
user_prompt = user_prompt_template.replace(
    "#sentence#",
    "и͑зʼми насъ г͆и о͑тъ напасті и͗ о͑тъ съблазн҄ъ творⱕштиихъ безакониѥ"
)

# apply chat template
chat_template = "<s>{role}\n{content}</s>"
generation_prompt = "bot"

prompts = [{"role": "system", "content": system_prompt},
           {"role": "user", "content": user_prompt}]
for i in range(len(prompts)):
    prompts[i] = chat_template.format(**prompts[i])

prompts.append(generation_prompt)
prompt = "\n".join(prompts)

<!-- print(prompt) -->

# inference
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

output = model.generate(
    **inputs,
    max_new_tokens=25
)

original_text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
# extract user input from the original text
original_text = re.search(r"This is the sentence to be annotated:\s*(.*?)\s*</s>", original_text, re.DOTALL)
if original_text:
    original_text = original_text.group(1).strip()
print("\n- Text to be classified:", original_text)

# Extract the generated text between the "bot" tags
text_classification = tokenizer.decode(output[0], skip_special_tokens=True)
provenance = re.search(r"bot\s*(.*?)\s*</s>", text_classification, re.DOTALL)
if provenance:
    provenance = provenance.group(1).strip()
else:    
    provenance = "No classification found in the generated text."      
print("\n- Classified provenance:", provenance)

Validation

$ cd validation
$ python evaluate.py -cd CACHE_DIR

Arguments
- CACHE_DIR: cache directory that contains inference.csv outputted by minex_inference.py
Validation will be written to CACHE_DIR/val.json
Confusion matrices will be written to CACHE_DIR/cm_TARGET.png

Publication

https://aclanthology.org/2025.ranlp-1.76.pdf

Citation

If you use this work, please cite it as follows:

@inproceedings{lendvai-etal-2025-instruction,
title = "Instruction Finetuning to Attribute Language Stage, Dialect, and Provenance Region to Historical Church Slavic Texts",
author = "Lendvai, Piroska and Reichel, Uwe D. and Jouravel, Anna and Rabus, Achim and Renje, Elena",
booktitle = "Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing (RANLP 2025)",
month = sep,
year = "2025",
address = "Varna, Bulgaria",
url = "https://aclanthology.org/2025.ranlp-1.76/",
pages = "654--662",
}

Project materials

https://gitlab.lrz.de/badw-it/quantislav-project-public

Downloads last month: 72

Safetensors

Model size

7B params

Tensor type

F16

Model tree for pirolen/Vikhr-HistoricalChurchSlavic

Base model

Vikhrmodels/Vikhr-7b-0.2

Finetuned

(1)

this model