palindromon-0.116M
This is dumb and should not be taken seriously.
This model card contains intentionally fake, unserious benchmark comparisons. Do not cite this as an evaluation, do not use it for decisions, and do not confuse it with a real SOTA claim.
palindromon-0.116M is a tiny reinforcement-learning policy that performs one
extremely important task: deciding whether short strings are palindromes by
walking two pointers inward.
Try the demo Space: nicolasmelo/palindromon-0.116M-space.
It has 116,101 parameters, which rounds to 0.116M if you are feeling generous. It is a decoder-only transformer with a policy head, a value head, and the confidence of a much larger model.
Model Details
- Architecture: tiny decoder-only transformer policy/value network
- Parameters: 116,101
- Task: procedural palindrome classification
- Training: PPO in a custom Gymnasium environment
- Default checkpoint:
checkpoints/policy.pt - Action space:
COMPAREMOVE_INWARDANSWER_PALINDROMEANSWER_NOT_PALINDROME
Completely Fake Benchmarks
These numbers are jokes. They were not produced by a real benchmark suite.
| Model | Params | Palindrome Arena Elo | Vibes / Watt | Strategic Pointer Depth | Notes |
|---|---|---|---|---|---|
| palindromon-0.116M | 0.116M | 9001 | extremely high | 2 pointers | Knows what it is here to do. |
| GPT-5.5 | undisclosed | 8999 | strong | overthinks | May write a sonnet before answering. |
| Claude Opus 4.7 | undisclosed | 8998 | elegant | reflective | Politely asks whether symmetry has meaning. |
| Qwen-3.6 | undisclosed | 8997 | efficient | multilingual | Strong, but fewer dumb branding points. |
| Random baseline | 0 | 50 | unbeatable | none | Sometimes correct, often with conviction. |
Intended Use
Use this to test the local palindrome RL scaffold, upload a tiny model to Hugging Face, or make a toy demo that should absolutely not be confused for a serious language model.
Limitations
- It only knows the palindrome environment.
- It can fail out of distribution.
- It is not a general-purpose model.
- The benchmark table above is fake.
Original Project
Tiny scaffold for a palindrome RL agent with:
- char-based observation encoding
- tiny decoder-only transformer policy/value network
- train/play CLI commands
Setup
uv sync
1) Train command scaffold
uv run palindrl-train --steps 5000 --batch-size 2048
By default it:
- uses
mpswhen available, elsecpu - trains with PPO on
RandomPalindromeEnv - logs to
runs/palindrl - saves checkpoint to
checkpoints/policy.pt
View logs:
uv run tensorboard --logdir runs/palindrl
Phased/curriculum training (warm-start between phases):
# Phase 1: shorter strings
uv run palindrl-train \
--steps 800 \
--env-max-len 16 \
--max-seq-len 128 \
--save-path checkpoints/policy_phase1.pt
# Phase 2: medium strings (continue from phase 1)
uv run palindrl-train \
--steps 800 \
--env-max-len 32 \
--max-seq-len 192 \
--init-checkpoint checkpoints/policy_phase1.pt \
--save-path checkpoints/policy_phase2.pt
# Phase 3: longer strings (continue from phase 2)
uv run palindrl-train \
--steps 1000 \
--env-max-len 64 \
--max-seq-len 320 \
--init-checkpoint checkpoints/policy_phase2.pt \
--save-path checkpoints/policy_phase3.pt
For char-mode training/inference, palindrome logic:
- ignores punctuation and spaces by default (
" " + string.punctuation) - compares letters case-insensitively by default (
Aandaare treated as equal)
So strings like A-b,c.a are normalized before pointer logic.
2) Play full episode
uv run palindrl-play --checkpoint checkpoints/policy.pt --text "Ola, sou-o nicolas."
This runs a full env loop until terminal answer and prints:
- per-step action trace (
COMPARE/MOVE_INWARD/ANSWER_*) - final predicted answer (
palindromeornot palindrome) - ground-truth label from the environment
Custom environment
The new environment lives in:
palindrl/environment/palindrome_env.py
This environment is used by palindrl-train and:
- generates random lowercase strings
- generates both classes by default (
--env-balanced-sampling, alternating palindrome/non-palindrome episodes) - optionally inserts spaces in the generated text
- exposes pointer state (
left/right,left_char/right_char) in the observation text - expects procedural actions:
0=COMPARE1=MOVE_INWARD2=ANSWER_PALINDROME3=ANSWER_NOT_PALINDROME
- uses shaped rewards (
compare,move,final answer,step penalty,timeout) - penalizes repeated
COMPAREon the same pointer pair to prevent reward farming - provides an
action_maskso PPO samples only structurally valid actions - keeps the same sampled word through the full episode; only
left/rightstate changes over timesteps
Quick example:
.venv/bin/python - <<'PY'
from palindrl.environment import RandomPalindromeEnv
env = RandomPalindromeEnv(space_probability=0.5)
obs, info = env.reset()
print(info["text"], info["is_palindrome"])
obs, reward, terminated, truncated, info = env.step(0) # COMPARE
print(info["action_name"], reward, terminated, truncated)
if not terminated and not truncated:
obs, reward, terminated, truncated, info = env.step(1) # MOVE_INWARD
print(info["action_name"], reward, terminated, truncated)
PY
Project files
palindrl/model.pytiny decoder-only transformer with policy/value headspalindrl/train.pyPPO training loop wired to the custom environmentpalindrl/play.pyfull-episode inference commandpalindrl/environment/palindrome_env.pyrandom-string Gymnasium environment
