Archaea-74M

Archaea-74M is a decoder-only causal language model with approximately 74 million parameters, pretrained from scratch on BetterDataset-2M. The model uses a LLaMA-style architecture with Grouped Query Attention (GQA) and was trained using Bf16 mixed precision with torch.compile enabled.

Training was stopped at 18,800 of a planned 25,000 steps to allow for architecture and data pipeline iteration in future runs.

Model Card

Attribute	Value
Model ID	`GODELEV/Archaea-74M`
Parameters	~74 million
Architecture	Decoder-only Transformer (LLaMA-style)
Attention	Grouped Query Attention (GQA)
Context Length	1024 tokens
Tokenizer	GPT-2 (~50,257 vocabulary)
Training Precision	Bf16
License	Apache-2.0
Framework	PyTorch + HuggingFace Transformers

Architecture

Archaea-74M implements LlamaForCausalLM with the following configuration:

Transformer Configuration

Parameter	Value
Hidden Size	512
Intermediate Size	1408
Number of Layers	8
Attention Heads	8
KV Heads (GQA)	2
GQA Ratio	4:1
Activation Function	SiLU
Normalization	RMSNorm (eps = 1e-5)
Context Length	1024 tokens

Grouped Query Attention

The model uses Grouped Query Attention with 8 query heads mapped to 2 key/value heads (a 4:1 ratio). This reduces KV cache memory footprint at inference time relative to standard multi-head attention while preserving representational capacity in the query projection.

Training

Dataset

Archaea-74M was trained on BetterDataset-2M, a multi-source corpus assembled from:

General web text
Conversational data
Instruction-oriented samples
Knowledge-focused content
Technical and code-related text

Samples were tokenized using the GPT-2 tokenizer and packed into contiguous 1024-token sequences. The dataset contains approximately 1.6 billion tokens in total. Over 18,800 training steps with an effective batch size of 64 sequences x 1024 tokens, the model was trained on approximately 1.23 billion tokens — roughly 0.77 passes through the dataset, meaning training concluded before a full epoch was completed.

Optimization

Parameter	Value
Optimizer	AdamW
Learning Rate Scheduler	OneCycleLR
Peak Learning Rate	6e-4
Weight Decay	0.1
Gradient Clipping	1.0
Sequence Length	1024
Micro Batch Size	32
Gradient Accumulation Steps	2
Effective Batch Size	64
Compilation	`torch.compile`

Training Statistics

Metric	Value
Total Steps Trained	18,800 / 25,000
Initial Loss	10.9223
Final Loss	2.9488
Best Loss	2.8071
Final Perplexity	19.08
Best Perplexity	16.56

Training Loss Curve

The curve shows the raw per-step loss alongside a smoothed moving average. The loss decrease is consistent throughout training with no notable instability or divergence events.

Learning Rate Schedule

OneCycleLR applies a linear warmup phase followed by cosine annealing decay to a minimum learning rate. The warmup phase stabilizes early training before the peak learning rate is reached.

Evaluation

Archaea-74M was evaluated using EleutherAI lm-evaluation-harness on an NVIDIA L4 GPU (24 GB VRAM) in float16 precision. Full datasets were used with no sample limits. Evaluation was conducted on 2026-06-01.

Per-Task Results

Benchmark	Few-Shot	Metric	Score	Stderr
HellaSwag	10	acc_norm	27.16%	±0.44%
PIQA	0	acc_norm	58.60%	±1.15%
WinoGrande	5	acc	51.14%	±1.41%
BoolQ	0	acc	56.30%	±0.87%
ARC-Easy	25	acc_norm	40.11%	±1.01%
ARC-Challenge	25	acc_norm	23.04%	±1.23%
OpenBookQA	0	acc_norm	26.00%	±1.96%
CommonsenseQA	7	acc	18.84%	±1.12%
LAMBADA (OpenAI)	0	acc	18.05%	±0.54%
BLiMP	0	acc	74.89%	±0.14%
MMLU	5	acc	25.07%	±0.36%

Category Averages

Category	Benchmarks Included	Average Score
Commonsense / NLI	HellaSwag, PIQA, WinoGrande, BoolQ, ARC-Easy, ARC-Challenge, OpenBookQA, CommonsenseQA	37.65%
Language Modelling	LAMBADA (OpenAI)	18.05%
Linguistic	BLiMP	74.89%
Knowledge	MMLU	25.07%
Overall Average	All above	38.11%

Benchmark Score Distribution

HellaSwag      [=======·····················]  27.16%
PIQA           [================············]  58.60%
WinoGrande     [==============··············]  51.14%
BoolQ          [===============·············]  56.30%
ARC-Easy       [===========·················]  40.11%
ARC-Challenge  [======······················]  23.04%
OpenBookQA     [=======·····················]  26.00%
CommonsenseQA  [=====·······················]  18.84%
LAMBADA        [=====·······················]  18.05%
BLiMP          [======================······]  74.89%
MMLU           [=======·····················]  25.07%
               0%        25%       50%       75%      100%

Notes on Failed Tasks

Two tasks could not be evaluated due to infrastructure issues unrelated to the model:

social_iqa — HuggingFace deprecated the social_i_qa.py dataset loading script. Re-evaluation pending.
arithmetic_2digit — Task name has been renamed in the current version of lm-eval. Re-evaluation pending with the updated task identifier.

Evaluation Environment

Framework   : EleutherAI lm-evaluation-harness
Hardware    : NVIDIA L4 (24 GB VRAM), Google Colab
Precision   : float16
Batch Size  : 8
Limit       : None (full datasets)
Runtime     : ~36 minutes

Usage

Installation

pip install torch transformers

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "GODELEV/Archaea-74M"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "GODELEV/Archaea-74M"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "The future of artificial intelligence"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.8,
        do_sample=True,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Repository Structure

Archaea-74M/
├── config.json
├── generation_config.json
├── model.safetensors
├── tokenizer.json
├── tokenizer_config.json
├── Archaea74M_Learning_Rate_Schedule.png
├── Archaea74M_Training_Loss_Curve.png

Limitations

Archaea-74M is a base pretrained model. It has not undergone instruction tuning, RLHF, preference optimization, or any alignment procedure. The following limitations apply:

Outputs may contain hallucinated or factually incorrect content
Reasoning capability is constrained by the model's size and training duration (18,800 steps of a planned 25,000)
Factual accuracy is inconsistent, particularly on knowledge-intensive tasks (MMLU: 25.07%)
The model is sensitive to prompt phrasing; small changes in input can produce substantially different outputs
Context length is fixed at 1024 tokens; inputs longer than this will be truncated
The model should not be used for medical, legal, financial, or safety-critical applications

Future Work

Instruction fine-tuning on a curated supervised dataset
Re-evaluation on social_iqa and arithmetic once dataset/task compatibility is resolved
Context length extension beyond 1024 tokens
Evaluation on additional benchmarks relevant to the SLM category

Citation

@misc{archaea74m,
  title   = {Archaea-74M},
  author  = {Akshit Kumar},
  year    = {2026},
  publisher = {Hugging Face},
  url     = {https://huggingface.co/GODELEV/Archaea-74M}
}

Downloads last month: 58

Safetensors

Model size

74M params

Tensor type

F32

GODELEV
/

Archaea-74M