PULI LlumiX 32K base (6.74B billion parameter)

For further details or testing our instruct model, see our demo site.

Trained with OpenChatKit github
The LLaMA-2-7B-32K model were continuously pretrained on Hungarian dataset
The model has been extended to a context length of 32K with position interpolation
Checkpoint: 100 000 steps

Dataset for continued pretraining

Hungarian: 7.9 billion words, documents (763K) that exceed 5000 words in length
English: Long Context QA (2 billion words), BookSum (78 million words)

Limitations

max_seq_length = 32 768
float16
vocab size: 32 000

Usage with pipeline

from transformers import pipeline, LlamaForCausalLM, LlamaTokenizer

model = LlamaForCausalLM.from_pretrained("NYTK/PULI-LlumiX-32K")
tokenizer = LlamaTokenizer.from_pretrained("NYTK/PULI-LlumiX-32K")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

print(generator(prompt, max_new_tokens=30)[0]["generated_text"])

NYTK
/

PULI-LlumiX-32K

PULI LlumiX 32K base (6.74B billion parameter)

Dataset for continued pretraining

Limitations

Usage with pipeline

Model tree for NYTK/PULI-LlumiX-32K