ChangeIsKey/llama3-janus-pos

Janus (PoS)

(Built with Meta Llama 3)

For the version without PoS tag visit Janus.

Model Details

Model Name: Janus
Version: 1.0
Developers: Pierluigi Cassotti, Nina Tahmasebi
Affiliation: University of Gothenburg
License: MIT
GitHub Repository: Historical Word Usage Generation
Paper: Sense-specific Historical Word Usage Generation
Contact: pierluigi.cassotti@gu.se

Model Description

Janus is a fine-tuned Llama 3 8B model designed to generate historically and semantically accurate word usages. It takes as input a word, its sense definition, and a year and produces example sentences that reflect linguistic usage from the specified period. This model is particularly useful for semantic change detection, historical NLP, and linguistic research.

Intended Use

Semantic Change Detection: Investigating how word meanings evolve over time.
Historical Text Processing: Enhancing the understanding and modeling of historical texts.
Corpus Expansion: Generating sense-annotated corpora for linguistic studies.

Training Data

Dataset: Extracted from the Oxford English Dictionary (OED)
Size: Over 1.2 million sense-annotated historical usages
Time Span: 1700 - 2020

Data Format:

<year><|t|><lemma><|t|><definition><|s|><historical usage sentence><|end|>

Janus (PoS) Format:

<year><|t|><lemma><|t|><definition><|p|><PoS><|p|><|s|><historical usage sentence><|end|>

Training Procedure

Base Model: meta-llama/Llama-3-8B
Optimization: QLoRA (Quantized Low-Rank Adaptation)
Batch Size: 4
Learning Rate: 2e-4
Epochs: 1

Model Performance

Temporal Accuracy: Root mean squared error (RMSE) of ~52.7 years (close to OED ground truth)
Semantic Accuracy: Comparable to OED test data on human evaluations
Context Variability: Low lexical repetition, preserving natural linguistic diversity

Usage Example

Generating Historical Usages

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "ChangeIsKey/llama3-janus-pos"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

input_text = "1800<|t|>awful<|t|>Used to emphasize something unpleasant or negative; ‘such a’, ‘an absolute’.<|p|>jj<|p|><|s|>"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

output = model.generate(**inputs, temperature=1.0, top_p=0.9, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

For more examples, see the GitHub repository Historical Word Usage Generation

Limitations & Ethical Considerations

Historical Bias: The model may reflect biases present in historical texts.
Time Granularity: The temporal resolution is approximate (~50 years RMSE).
Modern Influence: Despite fine-tuning, the model may still generate modern phrases in older contexts.
Not Trained for Fairness: The model has not been explicitly trained to be fair or unbiased. It may produce sensitive, outdated, or culturally inappropriate content.

Citation

If you use Janus, please cite:

@article{Cassotti2024Janus,
  author = {Pierluigi Cassotti and Nina Tahmasebi},
  title = {Sense-specific Historical Word Usage Generation},
  journal = {TACL},
  year = {2025}
}

ChangeIsKey
/

llama3-janus-pos