YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
dant5-large
language: - da language_bcp47: - da - da-bornholm - da-synnejyl tags: - t5 license: cc-by-4.0 datasets: - dagw widget: - text: "Aarhus er Danmarks ." co2_eq_emissions: training_type: "pretraining" geographical_location: "Copenhagen, Denmark" hardware_used: "4 A100 GPUs, 508 training hours" emissions: 132080
dant5-large
is a 770M parameter model with architecture identical to t5-large
. Training details are given in the paper Training a T5 Using Lab-sized Resources. It was trained for 10 epochs on the Danigh GigaWord Corpus (official website, paper).
To use the model
from transformers import AutoTokenizer, T5ForConditionalGeneration
model_name = "strombergnlp/dant5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
original_text = "Aarhus er Danmarks <extra_id_0> landets ældste. Under navnet Aros, som betyder å-munding, optræder den i skriftlige kilder i 900-tallet, men <extra_id_1> historie tilbage til 700-tallet.<extra_id_2>"
original_label = "<extra_id_0> næststørste by og en af <extra_id_1> arkæologiske fund fører dens <extra_id_2>"
input_ids = tokenizer(original_text, return_tensors="pt").input_ids
labels = tokenizer(original_label, return_tensors="pt").input_ids
loss = model(input_ids=input_ids, labels=labels).loss
print(f"Original text: {original_text}")
print(f"Original label: {original_label}")
print(f"Loss for the original label is {loss.item()}")
sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
print(f"A sample generated continuation: ")
print(sequences[0])
You should see output similar to:
Original text: Aarhus er Danmarks <extra_id_0> landets ældste. Under navnet Aros, som betyder å-munding, optræder den i skriftlige kilder i 900-tallet, men <extra_id_1> historie tilbage til 700-tallet.<extra_id_2>
Original label: <extra_id_0> næststørste by og en af <extra_id_1> arkæologiske fund fører dens <extra_id_2>
Loss for the original label is 4.174272537231445
A sample generated continuation:
<pad><extra_id_0> ældste by og<extra_id_1> har sin<extra_id_2> Se også<extra_id_3></s>
- Downloads last month
- 121
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.