Llara

Llara is a 91.4M parameter autoregressive language model trained from scratch on English web text. It follows the GPT-2 Small architecture and is trained entirely from random initialisation โ€” no pretrained weights, no distillation, no fine-tuning of an existing model. but it does use GPT's tokenizer

The name Llara is original and unrelated to LLaMA or LoRA.


Model Details

Property Value
Architecture GPT-2 (decoder-only transformer)
Parameters ~90-100M
Context length 256 tokens
Embedding dim 768
Layers 12
Attention heads 12
Vocabulary 50,257 (GPT-2 BPE)
Training data FineWeb (HuggingFaceFW/fineweb), Custom dataset
Training docs 1,000,000 documents
Epochs 1
Precision fp16

Training

Llara was trained on 1 million documents sampled from FineWeb, a large-scale curated English web dataset. Documents were tokenised with the GPT-2 BPE tokeniser and packed into non-overlapping 1024-token blocks.

Training configuration:

Hyperparameter Value
Optimiser AdamW
Learning rate 3e-4
LR schedule Cosine decay
Warmup steps 2,000
Weight decay 0.1
Effective batch size 32
Gradient accumulation 8 steps
Dropout 0.1 (residual, embedding, attention)

Gradient checkpointing was enabled throughout training to reduce memory usage.


Usage

from transformers import GPT2LMHeadModel, AutoTokenizer, pipeline

model = GPT2LMHeadModel.from_pretrained("helloadhavan/llara1.0-100M-base")
tokenizer = AutoTokenizer.from_pretrained("helloadhavan/llara1.0-100M-base")

gen = pipeline("text-generation", model=model, tokenizer=tokenizer)

output = gen(
    "The history of artificial intelligence",
    max_new_tokens=200,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.1,
)

print(output[0]["generated_text"])

Limitations

  • Llara is trained on English web text only and performs poorly on other languages.
  • Like all autoregressive LMs trained on web data, it may reproduce biases, factual errors, or inappropriate content present in the training corpus.
  • It is a research model trained from scratch and is not instruction-tuned or aligned โ€” it should not be used in production or user-facing applications without further fine-tuning and safety work.
  • At 95M parameters and 256k training documents, it is significantly smaller and less trained than models like GPT-2 (which saw 40GB of text). Outputs may be incoherent on complex prompts.

Intended Use

Llara is intended for:

  • Research and experimentation with small language models
  • Learning how GPT-style models are trained from scratch
  • A base for fine-tuning on downstream tasks

Training Framework

Trained using Hugging Face Transformers Trainer on a single GPU.


License

Apache 2.0

Note: i am a AI hobbyist, not an AI engineer
Downloads last month
-
Safetensors
Model size
91.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support