DIEGETIC: Epistemic Reasoning Language Model

Model Name: diegetic-v2-qwen2.5-1.5b-lora Base Model: Qwen/Qwen2.5-1.5B Method: LoRA Fine-tuning License: Apache 2.0

Model Description

DIEGETIC (Dynamically-grounded Inference Engine for Generative Epistemic Tracking In Conversation) is a language model fine-tuned for epistemic reasoning—the ability to track what it knows, what it doesn't know, and the appropriate level of certainty for each claim.

Key Capabilities

Source Attribution: Properly cites information sources with appropriate confidence
Appropriate Refusal: Refuses to answer unknowable questions without hallucination
Graduated Uncertainty: Uses calibrated confidence levels (high/medium/low/n/a)
Evidence Citation: Backs claims with specific observations
Unknown Tracking: Explicitly lists what information is missing or unknowable

Use Cases

Question answering from observations/context
Information verification and fact-checking
Legal/compliance reasoning with source tracking
Medical reasoning with uncertainty quantification
Scientific reasoning with explicit unknowns
Any application requiring transparent reasoning with proper confidence levels

Model Details

Architecture

Base Model: Qwen2.5-1.5B (1.5 billion parameters)
Fine-tuning Method: LoRA (Low-Rank Adaptation)
- Rank (r): 16
- Alpha: 32
- Dropout: 0.05
- Target modules: All attention and MLP layers
Trainable Parameters: 18.5M (1.18% of total)
Total Parameters: 1.56B

Training Data

Dataset Size: 20,000 synthetic examples
Training Format: Natural language observations + JSON responses
Categories:
- Observation Use (25%)
- Source Attribution (20%)
- Complete Refusal (15%)
- Graduated Uncertainty (15%)
- Reasoning from Knowledge (12.5%)
- Multi-step Reasoning (7.5%)
- Partial Information (5%)

Training Configuration

Epochs: 3
Batch Size: 4 (per device)
Gradient Accumulation: 4
Effective Batch Size: 16
Learning Rate: 2e-5
Max Sequence Length: 512 tokens
Optimizer: AdamW
Training Time: 1.73 hours (NVIDIA A100)

Performance Metrics

Final Training Loss: 0.0196
Evaluation Loss: 0.0196
Token Accuracy: 99.24%
Valid JSON Rate: 100% (20/20 tests)
Refusal Accuracy: 100% (5/5 unknowable scenarios)
Overall Test Pass Rate: 100% (20/20 diverse scenarios)

Usage

Installation

pip install transformers peft torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
import json

# Load model
model_path = "howellx/diegetic-v2-qwen2.5-1.5b-lora"
base_model_name = "Qwen/Qwen2.5-1.5B"

# Load tokenizer from base model
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# Add custom special tokens
special_tokens = {
    "additional_special_tokens": [
        "<OBS>", "</OBS>",
        "<BELIEF>", "</BELIEF>",
        "<MEM>", "</MEM>",
        "<TASK>", "</TASK>",
        "<OUTPUT_JSON>", "</OUTPUT_JSON>",
        "<EPISTEMIC>",
        "<REFUSE_DIEGETIC>"
    ]
}
tokenizer.add_special_tokens(special_tokens)

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)

# Resize token embeddings
base_model.resize_token_embeddings(len(tokenizer))

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, model_path, is_trainable=False)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

# Format prompt
prompt = """You are a helpful assistant that answers questions based solely on provided observations.

What you observed:
Sarah told you that the meeting was cancelled. You have not verified this independently.

Question: Is the meeting cancelled?

Please provide your response as JSON with the following structure:
{
  "claims": ["list of claims you are making"],
  "confidence": "high/medium/low/n/a",
  "evidence": ["list of evidence from observations"],
  "unknown": ["list of things you don't know"]
}

Response:"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

# Parse JSON
json_start = response.find("{")
json_end = response.rfind("}") + 1
result = json.loads(response[json_start:json_end])

print(json.dumps(result, indent=2))

Example Output

{
  "claims": [
    "Sarah said the meeting was cancelled"
  ],
  "confidence": "low",
  "evidence": [
    "Sarah stated cancellation"
  ],
  "unknown": [
    "Meeting actual status"
  ]
}

Performance Benchmarks

Test Suite Results (20 Diverse Scenarios)

Category	Tests	Pass Rate	Quality Score
Refusal (Unknowable)	5	100%	4.00/4.0
Multi-step Reasoning	3	100%	4.00/4.0
Partial Information	3	100%	4.00/4.0
Source Attribution	3	100%	4.00/4.0
Graduated Uncertainty	3	100%	4.00/4.0
Direct Observation	3	100%	4.00/4.0
OVERALL	20	100%	4.00/4.0

Prompt Format

The model expects a specific prompt format:

You are a helpful assistant that answers questions based solely on provided observations.

What you observed:
[Your observations or context here]

Question: [Your question here]

Please provide your response as JSON with the following structure:
{
  "claims": ["list of claims you are making"],
  "confidence": "high/medium/low/n/a",
  "evidence": ["list of evidence from observations"],
  "unknown": ["list of things you don't know"]
}

Response:

Response Structure

claims: Array of factual claims being made
confidence: Overall confidence level
- high: Direct observation, verified facts
- medium: Reasonable inference, needs verification
- low: Uncertain, conflicting info, second-hand
- n/a: Unknowable, insufficient information
evidence: Array of evidence supporting claims
unknown: Array of information that is missing or unknowable

Limitations

Known Limitations

Arithmetic: May have minor calculation errors
Context Length: Limited to 512 tokens input
Language: Primarily English
Reasoning Depth: Complex multi-hop reasoning may need additional prompting

When NOT to Use

Critical decision-making without human oversight
Medical diagnosis (use only as decision support with expert review)
Legal advice (use only for research with professional review)
Financial advice (requires expert validation)
Safety-critical systems without extensive validation

Recommended Use

Research and analysis with human oversight
Information organization and structuring
Question answering from provided context
Uncertainty quantification in reasoning tasks
Source attribution and fact verification assistance

Ethical Considerations

Intended Use

This model is designed to improve epistemic reasoning in AI systems by:

Reducing hallucinations through appropriate refusal
Improving source attribution and transparency
Calibrating confidence levels appropriately
Explicitly tracking unknowns

Potential Misuse

Over-reliance: Users may trust model outputs without verification
Context Manipulation: Malicious actors could provide false observations
Confidence Miscalibration: Model confidence ≠ absolute certainty

Mitigation

Always verify critical claims against authoritative sources
Use model as decision support, not final arbiter
Review evidence and unknown fields for completeness
Combine with human expertise for important decisions

Training Details

Dataset Generation

The training dataset consists of 20,000 synthetic examples across 7 categories:

Observation Use (5,000): Extract and use information from observations
Source Attribution (4,000): Properly cite sources with confidence
Complete Refusal (3,000): Refuse unknowable questions appropriately
Graduated Uncertainty (3,000): Use calibrated confidence levels
Reasoning from Knowledge (2,500): Apply general knowledge appropriately
Multi-step Reasoning (1,500): Perform multi-hop inference
Partial Information (1,000): Handle incomplete information

Training Procedure

Base model: Qwen/Qwen2.5-1.5B
LoRA adapters added to all attention and MLP layers
Trained for 3 epochs on 20K examples
Natural language format (not nested JSON) to reduce format errors
Chat format support for flexible input styles

Key Innovations

Natural language observations: "What you observed:" format vs structured JSON
Explicit unknown tracking: Train model to list what it doesn't know
Confidence calibration: Varied examples with appropriate confidence levels
Refusal training: 15% of dataset focused on unknowable scenarios

Citation

If you use this model in your research, please cite:

@misc{diegetic2026,
  title={DIEGETIC: Epistemic Reasoning Language Model},
  author={Howell, Justin},
  year={2026},
  url={https://huggingface.co/howellx/diegetic-v2-qwen2.5-1.5b-lora},
  note={Production Release}
}

License

This model is released under the Apache 2.0 License.

The base model (Qwen/Qwen2.5-1.5B) is also under Apache 2.0 License.

Acknowledgments

Base Model: Qwen Team for Qwen2.5-1.5B
Training Framework: HuggingFace Transformers, TRL, PEFT

Model Status: Production Ready Last Updated: February 5, 2026 Test Coverage: 20 scenarios, 100% pass rate

Downloads last month: 2

Model tree for howellx/diegetic-v2-qwen2.5-1.5b-lora

Base model

Qwen/Qwen2.5-1.5B

Adapter

(514)

this model

Evaluation results

Test Accuracy (20 scenarios) on DIEGETIC Test Suite
self-reported

100.000
Refusal Accuracy on DIEGETIC Test Suite
self-reported

100.000
Source Attribution on DIEGETIC Test Suite
self-reported

100.000

howellx
/

diegetic-v2-qwen2.5-1.5b-lora