Archaea-74M
Archaea-74M is a decoder-only causal language model with approximately 74 million parameters, pretrained from scratch on BetterDataset-2M. The model uses a LLaMA-style architecture with Grouped Query Attention (GQA) and was trained using Bf16 mixed precision with torch.compile enabled.
Training was stopped at 18,800 of a planned 25,000 steps to allow for architecture and data pipeline iteration in future runs.
Model Card
| Attribute | Value |
|---|---|
| Model ID | GODELEV/Archaea-74M |
| Parameters | ~74 million |
| Architecture | Decoder-only Transformer (LLaMA-style) |
| Attention | Grouped Query Attention (GQA) |
| Context Length | 1024 tokens |
| Tokenizer | GPT-2 (~50,257 vocabulary) |
| Training Precision | Bf16 |
| License | Apache-2.0 |
| Framework | PyTorch + HuggingFace Transformers |
Architecture
Archaea-74M implements LlamaForCausalLM with the following configuration:
Transformer Configuration
| Parameter | Value |
|---|---|
| Hidden Size | 512 |
| Intermediate Size | 1408 |
| Number of Layers | 8 |
| Attention Heads | 8 |
| KV Heads (GQA) | 2 |
| GQA Ratio | 4:1 |
| Activation Function | SiLU |
| Normalization | RMSNorm (eps = 1e-5) |
| Context Length | 1024 tokens |
Grouped Query Attention
The model uses Grouped Query Attention with 8 query heads mapped to 2 key/value heads (a 4:1 ratio). This reduces KV cache memory footprint at inference time relative to standard multi-head attention while preserving representational capacity in the query projection.
Training
Dataset
Archaea-74M was trained on BetterDataset-2M, a multi-source corpus assembled from:
- General web text
- Conversational data
- Instruction-oriented samples
- Knowledge-focused content
- Technical and code-related text
Samples were tokenized using the GPT-2 tokenizer and packed into contiguous 1024-token sequences. The dataset contains approximately 1.6 billion tokens in total. Over 18,800 training steps with an effective batch size of 64 sequences x 1024 tokens, the model was trained on approximately 1.23 billion tokens — roughly 0.77 passes through the dataset, meaning training concluded before a full epoch was completed.
Optimization
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate Scheduler | OneCycleLR |
| Peak Learning Rate | 6e-4 |
| Weight Decay | 0.1 |
| Gradient Clipping | 1.0 |
| Sequence Length | 1024 |
| Micro Batch Size | 32 |
| Gradient Accumulation Steps | 2 |
| Effective Batch Size | 64 |
| Compilation | torch.compile |
Training Statistics
| Metric | Value |
|---|---|
| Total Steps Trained | 18,800 / 25,000 |
| Initial Loss | 10.9223 |
| Final Loss | 2.9488 |
| Best Loss | 2.8071 |
| Final Perplexity | 19.08 |
| Best Perplexity | 16.56 |
Training Loss Curve
The curve shows the raw per-step loss alongside a smoothed moving average. The loss decrease is consistent throughout training with no notable instability or divergence events.
Learning Rate Schedule
OneCycleLR applies a linear warmup phase followed by cosine annealing decay to a minimum learning rate. The warmup phase stabilizes early training before the peak learning rate is reached.
Evaluation
Archaea-74M was evaluated using EleutherAI lm-evaluation-harness on an NVIDIA L4 GPU (24 GB VRAM) in float16 precision. Full datasets were used with no sample limits. Evaluation was conducted on 2026-06-01.
Per-Task Results
| Benchmark | Few-Shot | Metric | Score | Stderr |
|---|---|---|---|---|
| HellaSwag | 10 | acc_norm | 27.16% | ±0.44% |
| PIQA | 0 | acc_norm | 58.60% | ±1.15% |
| WinoGrande | 5 | acc | 51.14% | ±1.41% |
| BoolQ | 0 | acc | 56.30% | ±0.87% |
| ARC-Easy | 25 | acc_norm | 40.11% | ±1.01% |
| ARC-Challenge | 25 | acc_norm | 23.04% | ±1.23% |
| OpenBookQA | 0 | acc_norm | 26.00% | ±1.96% |
| CommonsenseQA | 7 | acc | 18.84% | ±1.12% |
| LAMBADA (OpenAI) | 0 | acc | 18.05% | ±0.54% |
| BLiMP | 0 | acc | 74.89% | ±0.14% |
| MMLU | 5 | acc | 25.07% | ±0.36% |
Category Averages
| Category | Benchmarks Included | Average Score |
|---|---|---|
| Commonsense / NLI | HellaSwag, PIQA, WinoGrande, BoolQ, ARC-Easy, ARC-Challenge, OpenBookQA, CommonsenseQA | 37.65% |
| Language Modelling | LAMBADA (OpenAI) | 18.05% |
| Linguistic | BLiMP | 74.89% |
| Knowledge | MMLU | 25.07% |
| Overall Average | All above | 38.11% |
Benchmark Score Distribution
HellaSwag [=======·····················] 27.16%
PIQA [================············] 58.60%
WinoGrande [==============··············] 51.14%
BoolQ [===============·············] 56.30%
ARC-Easy [===========·················] 40.11%
ARC-Challenge [======······················] 23.04%
OpenBookQA [=======·····················] 26.00%
CommonsenseQA [=====·······················] 18.84%
LAMBADA [=====·······················] 18.05%
BLiMP [======================······] 74.89%
MMLU [=======·····················] 25.07%
0% 25% 50% 75% 100%
Notes on Failed Tasks
Two tasks could not be evaluated due to infrastructure issues unrelated to the model:
social_iqa— HuggingFace deprecated thesocial_i_qa.pydataset loading script. Re-evaluation pending.arithmetic_2digit— Task name has been renamed in the current version of lm-eval. Re-evaluation pending with the updated task identifier.
Evaluation Environment
Framework : EleutherAI lm-evaluation-harness
Hardware : NVIDIA L4 (24 GB VRAM), Google Colab
Precision : float16
Batch Size : 8
Limit : None (full datasets)
Runtime : ~36 minutes
Usage
Installation
pip install torch transformers
Loading the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "GODELEV/Archaea-74M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto"
)
Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "GODELEV/Archaea-74M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "The future of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.8,
do_sample=True,
repetition_penalty=1.2,
pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Repository Structure
Archaea-74M/
├── config.json
├── generation_config.json
├── model.safetensors
├── tokenizer.json
├── tokenizer_config.json
├── Archaea74M_Learning_Rate_Schedule.png
├── Archaea74M_Training_Loss_Curve.png
Limitations
Archaea-74M is a base pretrained model. It has not undergone instruction tuning, RLHF, preference optimization, or any alignment procedure. The following limitations apply:
- Outputs may contain hallucinated or factually incorrect content
- Reasoning capability is constrained by the model's size and training duration (18,800 steps of a planned 25,000)
- Factual accuracy is inconsistent, particularly on knowledge-intensive tasks (MMLU: 25.07%)
- The model is sensitive to prompt phrasing; small changes in input can produce substantially different outputs
- Context length is fixed at 1024 tokens; inputs longer than this will be truncated
- The model should not be used for medical, legal, financial, or safety-critical applications
Future Work
- Instruction fine-tuning on a curated supervised dataset
- Re-evaluation on
social_iqaandarithmeticonce dataset/task compatibility is resolved - Context length extension beyond 1024 tokens
- Evaluation on additional benchmarks relevant to the SLM category
Citation
@misc{archaea74m,
title = {Archaea-74M},
author = {Akshit Kumar},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/GODELEV/Archaea-74M}
}
- Downloads last month
- 58