Archaea-74M

Archaea-74M is a decoder-only causal language model with approximately 74 million parameters, pretrained from scratch on BetterDataset-2M. The model uses a LLaMA-style architecture with Grouped Query Attention (GQA) and was trained using Bf16 mixed precision with torch.compile enabled.

Training was stopped at 18,800 of a planned 25,000 steps to allow for architecture and data pipeline iteration in future runs.


Model Card

Attribute Value
Model ID GODELEV/Archaea-74M
Parameters ~74 million
Architecture Decoder-only Transformer (LLaMA-style)
Attention Grouped Query Attention (GQA)
Context Length 1024 tokens
Tokenizer GPT-2 (~50,257 vocabulary)
Training Precision Bf16
License Apache-2.0
Framework PyTorch + HuggingFace Transformers

Architecture

Archaea-74M implements LlamaForCausalLM with the following configuration:

Transformer Configuration

Parameter Value
Hidden Size 512
Intermediate Size 1408
Number of Layers 8
Attention Heads 8
KV Heads (GQA) 2
GQA Ratio 4:1
Activation Function SiLU
Normalization RMSNorm (eps = 1e-5)
Context Length 1024 tokens

Grouped Query Attention

The model uses Grouped Query Attention with 8 query heads mapped to 2 key/value heads (a 4:1 ratio). This reduces KV cache memory footprint at inference time relative to standard multi-head attention while preserving representational capacity in the query projection.


Training

Dataset

Archaea-74M was trained on BetterDataset-2M, a multi-source corpus assembled from:

  • General web text
  • Conversational data
  • Instruction-oriented samples
  • Knowledge-focused content
  • Technical and code-related text

Samples were tokenized using the GPT-2 tokenizer and packed into contiguous 1024-token sequences. The dataset contains approximately 1.6 billion tokens in total. Over 18,800 training steps with an effective batch size of 64 sequences x 1024 tokens, the model was trained on approximately 1.23 billion tokens — roughly 0.77 passes through the dataset, meaning training concluded before a full epoch was completed.

Optimization

Parameter Value
Optimizer AdamW
Learning Rate Scheduler OneCycleLR
Peak Learning Rate 6e-4
Weight Decay 0.1
Gradient Clipping 1.0
Sequence Length 1024
Micro Batch Size 32
Gradient Accumulation Steps 2
Effective Batch Size 64
Compilation torch.compile

Training Statistics

Metric Value
Total Steps Trained 18,800 / 25,000
Initial Loss 10.9223
Final Loss 2.9488
Best Loss 2.8071
Final Perplexity 19.08
Best Perplexity 16.56

Training Loss Curve

The curve shows the raw per-step loss alongside a smoothed moving average. The loss decrease is consistent throughout training with no notable instability or divergence events.

Learning Rate Schedule

OneCycleLR applies a linear warmup phase followed by cosine annealing decay to a minimum learning rate. The warmup phase stabilizes early training before the peak learning rate is reached.


Evaluation

Archaea-74M was evaluated using EleutherAI lm-evaluation-harness on an NVIDIA L4 GPU (24 GB VRAM) in float16 precision. Full datasets were used with no sample limits. Evaluation was conducted on 2026-06-01.

Per-Task Results

Benchmark Few-Shot Metric Score Stderr
HellaSwag 10 acc_norm 27.16% ±0.44%
PIQA 0 acc_norm 58.60% ±1.15%
WinoGrande 5 acc 51.14% ±1.41%
BoolQ 0 acc 56.30% ±0.87%
ARC-Easy 25 acc_norm 40.11% ±1.01%
ARC-Challenge 25 acc_norm 23.04% ±1.23%
OpenBookQA 0 acc_norm 26.00% ±1.96%
CommonsenseQA 7 acc 18.84% ±1.12%
LAMBADA (OpenAI) 0 acc 18.05% ±0.54%
BLiMP 0 acc 74.89% ±0.14%
MMLU 5 acc 25.07% ±0.36%

Category Averages

Category Benchmarks Included Average Score
Commonsense / NLI HellaSwag, PIQA, WinoGrande, BoolQ, ARC-Easy, ARC-Challenge, OpenBookQA, CommonsenseQA 37.65%
Language Modelling LAMBADA (OpenAI) 18.05%
Linguistic BLiMP 74.89%
Knowledge MMLU 25.07%
Overall Average All above 38.11%

Benchmark Score Distribution

HellaSwag      [=======·····················]  27.16%
PIQA           [================············]  58.60%
WinoGrande     [==============··············]  51.14%
BoolQ          [===============·············]  56.30%
ARC-Easy       [===========·················]  40.11%
ARC-Challenge  [======······················]  23.04%
OpenBookQA     [=======·····················]  26.00%
CommonsenseQA  [=====·······················]  18.84%
LAMBADA        [=====·······················]  18.05%
BLiMP          [======================······]  74.89%
MMLU           [=======·····················]  25.07%
               0%        25%       50%       75%      100%

Notes on Failed Tasks

Two tasks could not be evaluated due to infrastructure issues unrelated to the model:

  • social_iqa — HuggingFace deprecated the social_i_qa.py dataset loading script. Re-evaluation pending.
  • arithmetic_2digit — Task name has been renamed in the current version of lm-eval. Re-evaluation pending with the updated task identifier.

Evaluation Environment

Framework   : EleutherAI lm-evaluation-harness
Hardware    : NVIDIA L4 (24 GB VRAM), Google Colab
Precision   : float16
Batch Size  : 8
Limit       : None (full datasets)
Runtime     : ~36 minutes

Usage

Installation

pip install torch transformers

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "GODELEV/Archaea-74M"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "GODELEV/Archaea-74M"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "The future of artificial intelligence"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.8,
        do_sample=True,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Repository Structure

Archaea-74M/
├── config.json
├── generation_config.json
├── model.safetensors
├── tokenizer.json
├── tokenizer_config.json
├── Archaea74M_Learning_Rate_Schedule.png
├── Archaea74M_Training_Loss_Curve.png

Limitations

Archaea-74M is a base pretrained model. It has not undergone instruction tuning, RLHF, preference optimization, or any alignment procedure. The following limitations apply:

  • Outputs may contain hallucinated or factually incorrect content
  • Reasoning capability is constrained by the model's size and training duration (18,800 steps of a planned 25,000)
  • Factual accuracy is inconsistent, particularly on knowledge-intensive tasks (MMLU: 25.07%)
  • The model is sensitive to prompt phrasing; small changes in input can produce substantially different outputs
  • Context length is fixed at 1024 tokens; inputs longer than this will be truncated
  • The model should not be used for medical, legal, financial, or safety-critical applications

Future Work

  • Instruction fine-tuning on a curated supervised dataset
  • Re-evaluation on social_iqa and arithmetic once dataset/task compatibility is resolved
  • Context length extension beyond 1024 tokens
  • Evaluation on additional benchmarks relevant to the SLM category

Citation

@misc{archaea74m,
  title   = {Archaea-74M},
  author  = {Akshit Kumar},
  year    = {2026},
  publisher = {Hugging Face},
  url     = {https://huggingface.co/GODELEV/Archaea-74M}
}
Downloads last month
58
Safetensors
Model size
74M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train GODELEV/Archaea-74M

Spaces using GODELEV/Archaea-74M 2