Text Generation
Transformers
Safetensors
GGUF
Korean
English
llama
3b
korean
from-scratch
orpo
instruction-tuned
preference-aligned
fp8
b200
Eval Results (legacy)
text-generation-inference
Instructions to use pathcosmos/frankenstallm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pathcosmos/frankenstallm with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pathcosmos/frankenstallm")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("pathcosmos/frankenstallm") model = AutoModelForCausalLM.from_pretrained("pathcosmos/frankenstallm") - llama-cpp-python
How to use pathcosmos/frankenstallm with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pathcosmos/frankenstallm", filename="gguf/frankenstallm-3b-Q4_K_M.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use pathcosmos/frankenstallm with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pathcosmos/frankenstallm:Q4_K_M # Run inference directly in the terminal: llama-cli -hf pathcosmos/frankenstallm:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pathcosmos/frankenstallm:Q4_K_M # Run inference directly in the terminal: llama-cli -hf pathcosmos/frankenstallm:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pathcosmos/frankenstallm:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf pathcosmos/frankenstallm:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pathcosmos/frankenstallm:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf pathcosmos/frankenstallm:Q4_K_M
Use Docker
docker model run hf.co/pathcosmos/frankenstallm:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use pathcosmos/frankenstallm with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pathcosmos/frankenstallm" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pathcosmos/frankenstallm", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/pathcosmos/frankenstallm:Q4_K_M
- SGLang
How to use pathcosmos/frankenstallm with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pathcosmos/frankenstallm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pathcosmos/frankenstallm", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pathcosmos/frankenstallm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pathcosmos/frankenstallm", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Ollama
How to use pathcosmos/frankenstallm with Ollama:
ollama run hf.co/pathcosmos/frankenstallm:Q4_K_M
- Unsloth Studio new
How to use pathcosmos/frankenstallm with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pathcosmos/frankenstallm to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pathcosmos/frankenstallm to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pathcosmos/frankenstallm to start chatting
- Docker Model Runner
How to use pathcosmos/frankenstallm with Docker Model Runner:
docker model run hf.co/pathcosmos/frankenstallm:Q4_K_M
- Lemonade
How to use pathcosmos/frankenstallm with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pathcosmos/frankenstallm:Q4_K_M
Run and chat with the model
lemonade run user.frankenstallm-Q4_K_M
List all available models
lemonade list
| """ | |
| Text generation (inference) script with temperature + top-p / top-k sampling. | |
| Usage: | |
| python eval/generate.py \ | |
| --checkpoint checkpoints/checkpoint-0100000 \ | |
| --prompt "Once upon a time" \ | |
| --max_new_tokens 200 \ | |
| --temperature 0.8 \ | |
| --top_p 0.9 \ | |
| --top_k 50 \ | |
| --device cuda:0 | |
| """ | |
| from __future__ import annotations | |
| import argparse | |
| import sys | |
| from pathlib import Path | |
| from typing import Generator | |
| import torch | |
| import torch.nn.functional as F | |
| from model.transformer import LLM | |
| from tokenizers import Tokenizer | |
| # --------------------------------------------------------------------------- | |
| # Sampling utilities | |
| # --------------------------------------------------------------------------- | |
| def top_p_filtering( | |
| logits: torch.Tensor, | |
| top_p: float = 0.9, | |
| top_k: int = 0, | |
| filter_value: float = float("-inf"), | |
| ) -> torch.Tensor: | |
| """ | |
| Apply top-k and / or top-p (nucleus) filtering to a logits tensor. | |
| Args: | |
| logits: 1-D or 2-D tensor of raw (un-normalised) logits. | |
| Shape: [vocab_size] or [batch, vocab_size]. | |
| top_k: Keep only the top-k tokens (0 = disabled). | |
| top_p: Keep the smallest set of tokens whose cumulative | |
| probability is >= top_p (1.0 = disabled). | |
| filter_value: Value assigned to filtered positions (−inf by default). | |
| Returns: | |
| Filtered logits with the same shape as input. | |
| """ | |
| # Work on a 2-D tensor [batch, vocab]. | |
| if logits.dim() == 1: | |
| logits = logits.unsqueeze(0) | |
| squeeze_output = True | |
| else: | |
| squeeze_output = False | |
| # --- Top-K --- | |
| if top_k > 0: | |
| k = min(top_k, logits.size(-1)) | |
| # Find the k-th largest value for each row. | |
| kth_values = torch.topk(logits, k, dim=-1).values[:, -1, None] | |
| logits = logits.masked_fill(logits < kth_values, filter_value) | |
| # --- Top-P (nucleus) --- | |
| if 0.0 < top_p < 1.0: | |
| sorted_logits, sorted_indices = torch.sort(logits, dim=-1, descending=True) | |
| cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1) | |
| # Remove tokens once cumulative probability exceeds top_p. | |
| # Shift right by one so that the token that *pushes* the cumulative | |
| # probability over the threshold is kept. | |
| sorted_indices_to_remove = cumulative_probs - F.softmax( | |
| sorted_logits, dim=-1 | |
| ) >= top_p | |
| sorted_logits = sorted_logits.masked_fill( | |
| sorted_indices_to_remove, filter_value | |
| ) | |
| # Scatter filtered sorted_logits back to the original ordering. | |
| logits = torch.zeros_like(logits).scatter_( | |
| -1, sorted_indices, sorted_logits | |
| ) | |
| if squeeze_output: | |
| logits = logits.squeeze(0) | |
| return logits | |
| # --------------------------------------------------------------------------- | |
| # Generation | |
| # --------------------------------------------------------------------------- | |
| def generate( | |
| model: torch.nn.Module, | |
| tokenizer: Tokenizer, | |
| prompt: str, | |
| max_new_tokens: int = 200, | |
| temperature: float = 0.8, | |
| top_p: float = 0.9, | |
| top_k: int = 50, | |
| device: str = "cuda:0", | |
| ) -> Generator[str, None, None]: | |
| """ | |
| Auto-regressive token generation with streaming output. | |
| Yields decoded string fragments (one token at a time) so callers can | |
| stream output to stdout without waiting for the full sequence. | |
| Args: | |
| model: A causal LM whose forward pass returns logits | |
| (last dim = vocab_size). | |
| tokenizer: Matching tokenizer; must expose encode / decode. | |
| prompt: Text prompt to condition on. | |
| max_new_tokens: Maximum number of new tokens to generate. | |
| temperature: Softmax temperature (1.0 = neutral, <1 = sharper). | |
| top_p: Nucleus sampling probability threshold. | |
| top_k: Top-K token candidates (0 = disabled). | |
| device: Torch device string. | |
| Yields: | |
| Decoded string for each newly generated token. | |
| """ | |
| model.eval() | |
| # Encode prompt. | |
| input_ids = torch.tensor([tokenizer.encode(prompt).ids], dtype=torch.long, device=device) | |
| eos_token_id: int | None = tokenizer.token_to_id("</s>") | |
| # Incremental generation. | |
| generated_ids = input_ids | |
| for _ in range(max_new_tokens): | |
| # Full-sequence forward (no KV cache) — each step re-runs all tokens. | |
| logits_all, _ = model(generated_ids) | |
| logits: torch.Tensor = logits_all[:, -1, :] # [1, vocab] | |
| # --- Temperature scaling --- | |
| if temperature != 1.0: | |
| logits = logits / max(temperature, 1e-8) | |
| # --- Top-k / Top-p filtering --- | |
| logits = top_p_filtering(logits, top_p=top_p, top_k=top_k) | |
| # --- Sample --- | |
| probs = F.softmax(logits, dim=-1) | |
| next_token_id = torch.multinomial(probs, num_samples=1) # [1, 1] | |
| generated_ids = torch.cat([generated_ids, next_token_id], dim=-1) | |
| # Decode and yield the new token. | |
| token_str: str = tokenizer.decode([next_token_id.item()]) | |
| yield token_str | |
| # Stop at EOS. | |
| if eos_token_id is not None and next_token_id.item() == eos_token_id: | |
| break | |
| # --------------------------------------------------------------------------- | |
| # Checkpoint loading | |
| # --------------------------------------------------------------------------- | |
| def load_model_and_tokenizer( | |
| checkpoint_dir: str, device: str | |
| ) -> tuple[torch.nn.Module, Tokenizer]: | |
| """ | |
| Load a model and tokenizer from a checkpoint directory. | |
| Expects: | |
| - <checkpoint_dir>/model.pt — model weights | |
| - <checkpoint_dir>/config.yaml — LMConfig | |
| - <checkpoint_dir>/tokenizer.json — HuggingFace tokenizers format | |
| """ | |
| ckpt_path = Path(checkpoint_dir) | |
| if not ckpt_path.exists(): | |
| raise FileNotFoundError(f"Checkpoint directory not found: {ckpt_path}") | |
| print(f"Loading model from: {ckpt_path}") | |
| model = LLM.from_pretrained(str(ckpt_path)).to(device=device, dtype=torch.float16) | |
| model.eval() | |
| tokenizer_path = ckpt_path / "tokenizer.json" | |
| if not tokenizer_path.exists(): | |
| # Fallback: try project-level tokenizer | |
| tokenizer_path = Path("tokenizer/korean_sp/tokenizer.json") | |
| print(f"Loading tokenizer from: {tokenizer_path}") | |
| tokenizer = Tokenizer.from_file(str(tokenizer_path)) | |
| return model, tokenizer | |
| # --------------------------------------------------------------------------- | |
| # Argument parsing | |
| # --------------------------------------------------------------------------- | |
| def parse_args() -> argparse.Namespace: | |
| parser = argparse.ArgumentParser( | |
| description="Generate text from a trained LLM checkpoint." | |
| ) | |
| parser.add_argument( | |
| "--checkpoint", | |
| required=True, | |
| help="Path to the checkpoint directory.", | |
| ) | |
| parser.add_argument( | |
| "--prompt", | |
| required=True, | |
| help="Input prompt text.", | |
| ) | |
| parser.add_argument( | |
| "--max_new_tokens", | |
| type=int, | |
| default=200, | |
| help="Maximum number of new tokens to generate (default: 200).", | |
| ) | |
| parser.add_argument( | |
| "--temperature", | |
| type=float, | |
| default=0.8, | |
| help="Sampling temperature (default: 0.8).", | |
| ) | |
| parser.add_argument( | |
| "--top_p", | |
| type=float, | |
| default=0.9, | |
| help="Top-p nucleus sampling threshold (default: 0.9).", | |
| ) | |
| parser.add_argument( | |
| "--top_k", | |
| type=int, | |
| default=50, | |
| help="Top-k token candidates; 0 disables top-k (default: 50).", | |
| ) | |
| parser.add_argument( | |
| "--device", | |
| default="cuda:0", | |
| help="Torch device to run inference on (default: cuda:0).", | |
| ) | |
| return parser.parse_args() | |
| # --------------------------------------------------------------------------- | |
| # Entry point | |
| # --------------------------------------------------------------------------- | |
| def main() -> None: | |
| args = parse_args() | |
| model, tokenizer = load_model_and_tokenizer(args.checkpoint, args.device) | |
| num_params = sum(p.numel() for p in model.parameters()) | |
| print(f"Model parameters: {num_params / 1e6:.1f}M") | |
| print(f"\nPrompt: {args.prompt!r}") | |
| print("-" * 60) | |
| print(args.prompt, end="", flush=True) | |
| generated_tokens = 0 | |
| for token_str in generate( | |
| model=model, | |
| tokenizer=tokenizer, | |
| prompt=args.prompt, | |
| max_new_tokens=args.max_new_tokens, | |
| temperature=args.temperature, | |
| top_p=args.top_p, | |
| top_k=args.top_k, | |
| device=args.device, | |
| ): | |
| print(token_str, end="", flush=True) | |
| generated_tokens += 1 | |
| print() # newline after generation | |
| print("-" * 60) | |
| print(f"Generated {generated_tokens} token(s).") | |
| if __name__ == "__main__": | |
| main() | |