Instructions to use ConeML/coneml-348m-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ConeML/coneml-348m-beta with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ConeML/coneml-348m-beta") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ConeML/coneml-348m-beta") model = AutoModelForCausalLM.from_pretrained("ConeML/coneml-348m-beta") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ConeML/coneml-348m-beta with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ConeML/coneml-348m-beta" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ConeML/coneml-348m-beta", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ConeML/coneml-348m-beta
- SGLang
How to use ConeML/coneml-348m-beta with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ConeML/coneml-348m-beta" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ConeML/coneml-348m-beta", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ConeML/coneml-348m-beta" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ConeML/coneml-348m-beta", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ConeML/coneml-348m-beta with Docker Model Runner:
docker model run hf.co/ConeML/coneml-348m-beta
ConeML 348M Beta
ConeML 348M Beta is the second public release in the ConeML research series — a 348M-parameter, scratch-trained small language model. It is the successor to ConeML 348M Alpha (polish900) and improves on it on the held-out reasoning, code, arithmetic, and calibration evaluations reported below. It is a research artifact and beta candidate, not a polished general assistant.
Why ConeML Exists
ConeML is an independent research effort exploring how much capability a compact language model can reach through deliberate data and curriculum design rather than scale alone. The clearest capability carried forward from Alpha is held-out transitive relation binding; Beta extends that capability while improving code generation and arithmetic in the same model.
Evaluations vs Alpha
All numbers are held-out probes (fresh entities disjoint from training), measured with the same protocol on both models.
Transitive inference, chat surface, first-choice accuracy, depths 1–5:
| Suite | Alpha | Beta |
|---|---|---|
| older / younger relation | 79 / 89 / 88 / 77 / 71 | 93 / 91 / 93 / 86 / 82 |
| unseen query phrasing | 56 / 73 / 59 / 48 / 34 | 69 / 66 / 67 / 72 / 76 |
| non-name entities (colored cards) | 51 / 50 / 41 / 31 / 28 | 62 / 63 / 37 / 34 / 23 (still weak — both) |
Other capabilities:
| Metric | Alpha | Beta |
|---|---|---|
| Code strict-exec (held-out functions) | 16.7% | 45% |
| Arithmetic, held-out 10-bucket (sympy-checked) | ~21% | 33% |
| Aggregate held-out perplexity | 9.17 | 6.24 |
| Calibration ECE (reasoning / code / agentic) | — | 0.037 / 0.032 / 0.015 |
| Output format | indentation unstable | clean first-token answers |
Standard public benchmarks (zero-shot, chat format) — reported for comparability, and modest as expected at this scale:
| Benchmark | Beta |
|---|---|
| GSM8K (300-item sample, exact-match) | 5.0% |
| HumanEval (pass@1, 164) | 0% |
These two numbers measure full multi-step / algorithmic problem-solving, which is beyond a 348M model: GSM8K reflects the unsolved multi-digit arithmetic, and HumanEval requires complete algorithmic solutions (the 45% code figure above is held-out simple function-body completion — a different and easier task). They are published for transparency, not as strengths.
On these evaluations Beta improves over the Alpha on held-out reasoning, code execution, arithmetic, perplexity, and output formatting. On the older/younger relation suite it is higher at every depth; on unseen-query phrasing it is higher at most depths (the Alpha is slightly higher at depth 2). The Alpha's internal fixed-template probe saturated at 100% (depths 1–3); Beta's held-out template accuracy is 99 / 97 / 95 — effectively equal, on a harder probe.
Intended Format
Prompt the model in the chat format below, using the exact User: / Assistant: markers. Raw completion (without the markers) produces degraded output. The template also ships in chat_template.jinja / tokenizer_config.json, so tokenizer.apply_chat_template(...) works directly.
User:
<instruction>
Assistant:
Loading
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "ConeML/coneml-348m-beta"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, torch_dtype=torch.float32, device_map="auto")
prompt = "User:\nMia is taller than Ben. Ben is taller than Zoe. Who is tallest? Return only the name.\nAssistant:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16, do_sample=False, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Architecture
- Family: Llama-style decoder · Parameters: ~348M · Layers: 30 · Hidden: 1024
- Attention heads: 8 · KV heads: 2 · Vocab: 32768 · Context length: 512
- Tokenizer: custom 32K
Strengths
- Scratch-trained 348M model that improves on its own alpha across the held-out evaluations reported here.
- Held-out transitive binding that generalizes across new names, new relations, and unseen query phrasing — higher than the alpha at every depth on the older/younger suite, and at most depths under unseen-query phrasing.
- Usable Python function-body generation with stable formatting (45% strict execution on the held-out evaluation reported here).
- Materially improved held-out arithmetic over the alpha.
- Well-calibrated on reasoning/code/agentic (ECE ≤ 0.04) — uncommon for models this size.
Known Limitations
- Multi-digit arithmetic is weak. Held-out 10-bucket arithmetic is 33% overall; reliable 3-digit and multiplication computation is not solved.
- Context length is 512 tokens; longer inputs are out of scope for this release.
- Transitive binding for non-name entities (e.g., objects) is near chance at depth — binding is still somewhat surface-shaped.
- All figures are research results from held-out probes and the standard benchmarks above — not production guarantees.
- Research release, not a replacement for larger general assistants.
License
Released for non-commercial use under CC BY-NC 4.0. Commercial use is not granted by this release.
- Downloads last month
- 21