Instructions to use razor5050/Moonlit-SummaryStories-45M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use razor5050/Moonlit-SummaryStories-45M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="razor5050/Moonlit-SummaryStories-45M")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("razor5050/Moonlit-SummaryStories-45M") model = AutoModelForCausalLM.from_pretrained("razor5050/Moonlit-SummaryStories-45M") - llama-cpp-python
How to use razor5050/Moonlit-SummaryStories-45M with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="razor5050/Moonlit-SummaryStories-45M", filename="gguf/moonlit-summarystories-45m-Q5_K_M.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use razor5050/Moonlit-SummaryStories-45M with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf razor5050/Moonlit-SummaryStories-45M:Q5_K_M # Run inference directly in the terminal: llama-cli -hf razor5050/Moonlit-SummaryStories-45M:Q5_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf razor5050/Moonlit-SummaryStories-45M:Q5_K_M # Run inference directly in the terminal: llama-cli -hf razor5050/Moonlit-SummaryStories-45M:Q5_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf razor5050/Moonlit-SummaryStories-45M:Q5_K_M # Run inference directly in the terminal: ./llama-cli -hf razor5050/Moonlit-SummaryStories-45M:Q5_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf razor5050/Moonlit-SummaryStories-45M:Q5_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf razor5050/Moonlit-SummaryStories-45M:Q5_K_M
Use Docker
docker model run hf.co/razor5050/Moonlit-SummaryStories-45M:Q5_K_M
- LM Studio
- Jan
- vLLM
How to use razor5050/Moonlit-SummaryStories-45M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "razor5050/Moonlit-SummaryStories-45M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "razor5050/Moonlit-SummaryStories-45M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/razor5050/Moonlit-SummaryStories-45M:Q5_K_M
- SGLang
How to use razor5050/Moonlit-SummaryStories-45M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "razor5050/Moonlit-SummaryStories-45M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "razor5050/Moonlit-SummaryStories-45M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "razor5050/Moonlit-SummaryStories-45M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "razor5050/Moonlit-SummaryStories-45M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use razor5050/Moonlit-SummaryStories-45M with Ollama:
ollama run hf.co/razor5050/Moonlit-SummaryStories-45M:Q5_K_M
- Unsloth Studio
How to use razor5050/Moonlit-SummaryStories-45M with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for razor5050/Moonlit-SummaryStories-45M to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for razor5050/Moonlit-SummaryStories-45M to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for razor5050/Moonlit-SummaryStories-45M to start chatting
- Docker Model Runner
How to use razor5050/Moonlit-SummaryStories-45M with Docker Model Runner:
docker model run hf.co/razor5050/Moonlit-SummaryStories-45M:Q5_K_M
- Lemonade
How to use razor5050/Moonlit-SummaryStories-45M with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull razor5050/Moonlit-SummaryStories-45M:Q5_K_M
Run and chat with the model
lemonade run user.Moonlit-SummaryStories-45M-Q5_K_M
List all available models
lemonade list
Moonlit-SummaryStories-45M
Moonlit-SummaryStories-45M is a 45M-parameter TinyStories model specialized for Summary → Story generation. It starts from the pretrained checkpoint of razor5050/TinyStories-45M and is then supervised fine-tuned to take a short summary prompt and generate a complete TinyStories-style story.
What this model does
Input format:
Summary: A little fox is afraid of the dark until a glowing jar helps him find his way home.
Story:
The model continues with a full short story.
Model details
- Architecture: LLaMA-style decoder-only transformer
- Parameters: 45.46M
- Hidden size: 512
- Layers: 13
- Attention heads: 8
- KV heads (GQA): 4
- Intermediate size: 1344
- Vocabulary size: 16384
- Context length: 512
- Tokenizer: SentencePiece unigram
Training recipe
Pretraining base
- Base model:
razor5050/TinyStories-45M - Original pretraining dataset:
roneneldan/TinyStories - Original pretraining epochs: 3
Fine-tuning task
- Dataset source:
roneneldan/TinyStoriesInstruct - Finetuning format:
Summary: ... Story:→ full story - Loss masking: prompt masked, loss only on story tokens
- No truncation policy: only samples that fully fit 512 total tokens were kept
- Usable SFT examples: 1702072
Exact usable dataset size under 512-token no-truncation rule
- Train: 1685116
- Validation: 16956
- Total: 1702072
Fine-tuning hyperparameters
- Epochs: 1
- Effective batch size: 64
- Micro-batch size: 8
- Learning rate: 8e-5
- Scheduler: cosine decay
- Precision: FP16
- Max sequence length: 512
Evaluation
- Validation loss: 1.238696612096066
- Perplexity: 3.4511122703552246
- Example generations: see
evaluation/40_prompts.json - Evaluation report: see
evaluation/report.md
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "razor5050/Moonlit-SummaryStories-45M"
model = AutoModelForCausalLM.from_pretrained(repo_id)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
prompt = "Summary: A shy rabbit learns to sing with the help of fireflies.
Story:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=220,
do_sample=True,
temperature=0.8,
top_p=0.95,
top_k=50,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Files in this repo
- Root: final finetuned model
checkpoints/base_pretrain_final/: pretrained base checkpoint used for finetuningcheckpoints/sft/: intermediate SFT checkpoints and final SFT exportevaluation/: metrics, prompt generations, and report
Hardware
- Training GPU: NVIDIA RTX 3060 12GB
- Intended deployment class: small creative story model
Notes
This model is optimized for TinyStories-style English story generation from a short summary prompt. Because the model context window is 512 tokens total, longer prompts reduce the available generation budget.
Generated on 2026-05-23 11:04:08
- Downloads last month
- 1,106
Model tree for razor5050/Moonlit-SummaryStories-45M
Unable to build the model tree, the base model loops to the model itself. Learn more.