🐉 Geryon-9B-v1 (Experimental Merge)

Geryon-9B-v1 is an experimental DARE-TIES merge built upon the Qwen/Qwen3.5-9B dense architecture.

The primary objective of this project is to explore the viability of layering multiple highly specialized fine-tunes onto a single base model without incurring catastrophic forgetting.

GGUF quantizations are available here:
https://huggingface.co/Tivaphraen/Geryon-9B-v1-GGUF

Model Description

Geryon-9B-v1 is designed primarily for:

  • general reasoning
  • coding assistance
  • agentic workflows
  • long-form text generation

It is a text-generation model built from a merge of multiple Qwen 3.5 9B fine-tunes with complementary specializations.

Model Components

Geryon-9B-v1 combines the following fine-tunes, all of which share the Qwen 3.5 9B base architecture:

  1. empero-ai/Qwythos-9B-Claude-Mythos-5-1M A reasoning model fine-tuned on more than 500M tokens of Claude Mythos / Fable traces, with a strong emphasis on Chain-of-Thought (CoT).

  2. empero-ai/Qwable-9B-Claude-Fable-5
    A model fine-tuned on agentic coding and reasoning traces, designed to imitate Claude Fable 5 and GPT-5.5 terminal-agent style tool use.

  3. Tesslate/OmniCoder-9B
    A coding-agent model fine-tuned via LoRA on 425K agentic coding trajectories.

Note
The base Qwen 3.5 architecture is inherently multimodal. However, vision-related components were explicitly removed or disabled in this merge's config.json to ensure clean execution in text-only environments such as llama.cpp.

Merge Configuration

The merge was performed using mergekit with the dare_ties method. A uniform density of 0.5 was applied to filter out low-magnitude weights.

models:
  - model: empero-ai/Qwythos-9B-Claude-Mythos-5-1M
    parameters:
      density: 0.5
      weight: 0.4

  - model: empero-ai/Qwable-9B-Claude-Fable-5
    parameters:
      density: 0.5
      weight: 0.3

  - model: Tesslate/OmniCoder-9B
    parameters:
      density: 0.5
      weight: 0.3

merge_method: dare_ties
base_model: Qwen/Qwen3.5-9B
dtype: bfloat16

Evaluation Results

This model has not yet undergone comprehensive benchmarking.

Initial local testing using EleutherAI's lm-evaluation-harness (via the llama.cpp backend on Windows) yielded the following results on the Q8_0 GGUF quantization:

Benchmark Setting Metric Value Stderr
GSM8K 5-shot, flexible exact_match 0.8514
GSM8K 5-shot, strict exact_match 0.8506 ± 0.0098

Important note: these scores were measured on the Q8_0 GGUF quantization during local inference, not on the raw Safetensors checkpoint. They should therefore be interpreted as practical deployment results rather than pure full-precision reference numbers.

For context, one of the main parent models, empero-ai/Qwythos-9B-Claude-Mythos-5-1M, reports the following GSM8K results. The local results for Geryon-9B-v1 are included below for direct comparison:

Task Metric Base Qwen3.5-9B Qwythos-9B Geryon-9B-v1 (Q8_0 GGUF, local) Δ vs Base Δ vs Qwythos
gsm8k exact_match (flexible) 0.6700 0.8600 0.8514 +0.1814 -0.0086
gsm8k exact_match (strict) 0.5100 0.8100 0.8506 +0.3406 +0.0406

These results suggest that Geryon-9B-v1 remains very close to Qwythos on flexible GSM8K scoring, while slightly exceeding the published Qwythos number on the currently available strict local run. However, comparisons should be interpreted cautiously because:

  • the current Geryon score was obtained on a quantized GGUF
  • backend/runtime differences (llama.cpp vs other stacks) may affect outputs
  • strict and flexible GSM8K scoring are not directly interchangeable

In other words, the merge appears to preserve most of the reasoning strength inherited from Qwythos, while adding coding and agentic capabilities from the other parents.

Further standardized testing (such as MMLU, HumanEval, and GPQA) could not yet be completed reliably in the author's local environment due to API logprob format incompatibilities and sandbox restrictions.

Call to the community:
Users with Linux environments or robust evaluation pipelines are invited to benchmark this model (MMLU, HumanEval, GPQA, etc.) and share results in the Community tab.

🔥 Generation Examples (Vibe Check)

To complement the benchmark section, here are unedited qualitative outputs from local testing using the Q8_0 GGUF quantization.

These examples are intended as a capability snapshot, not as a substitute for standardized evaluation.

1. Multilingual & Instruction Following (French / English)

Prompt:
Je suis un touriste qui visite Paris. Rédige-moi un poème de 4 lignes sur la Tour Eiffel. Le poème doit être en français parfait, mais tu dois expliquer tes choix de rimes en anglais juste en dessous.

Output:

Au sommet où les oiseaux s'envolent haut,
L'éclat doré danse sous le ciel d'hiver,
Paris respire un air doux et si profond,
Où la nuit devient douce comme un rêve.

**Rhyme Scheme Explanation:**
This poem follows an **AABB** rhyme scheme:
- *Lines 1 and 2 share the rhyme* *"haut"* (high) and *"hiver"* (winter). While they are not perfect rhymes in strict classical French, "haut" and "hiver" create a near-rhyme where the final vowel sounds /o/ and /e/ differ slightly but maintain a rhythmic flow typical of modern free verse.
- *Lines 3 and 4 share the rhyme* *"profond"* (deep) and *"rêve"* (dream). These are slant rhymes; "fond" ends with an open 'on' sound, while "rêve" ends with a closed 'ev' sound. In French poetry, this type of imperfect or assonantal rhyme is often used to create a softer, more lyrical atmosphere without the rigidity of perfect consonance.

Analysis: Strong bilingual separation and instruction-following. The Qwen multilingual base appears well preserved through the merge, even if the poetic/rhyme analysis is looser than strict classical French prosody.

2. Logical Reasoning (Mythos Heritage)

Prompt:
Dans une corbeille de fruits, il y a 3 pommes, 2 bananes, et 1 orange. Je retire toutes les pommes. Ensuite, j'ajoute un nombre de bananes égal au double des oranges restantes. Combien de fruits y a-t-il dans la corbeille au total maintenant ? Explique ta réflexion étape par étape avant de donner la réponse.

Output:

Voici le raisonnement étape par étape pour résoudre ce problème :

1. Analyse du contenu initial : La corbeille contient initialement 3 pommes, 2 bananes et 1 orange.
2. Première action (Retirer les pommes) : Tu retires toutes les pommes. Il ne reste donc plus aucune pomme dans la corbeille.
   - Pommes restantes : 0
   - Bananes restantes : 2
   - Oranges restantes : 1
3. Deuxième action (Ajout de bananes) : Tu ajoutes un nombre de bananes égal au double des oranges restantes.
   - Nombre d'oranges restantes = 1.
   - Le double de ce nombre est 1 × 2 = 2.
   - Donc, tu ajoutes 2 nouvelles bananes à celles qui étaient déjà présentes.
4. Calcul du total final : Maintenant, additionnons tous les fruits présents dans la corbeille :
   - Pommes : 0
   - Bananes : 2 (celles d'origine) + 2 (celles ajoutées) = 4 bananes au total.
   - Oranges : 1 (aucune n'a été retirée).
   - Total = 0 + 4 + 1 = 5.

Il y a maintenant 5 fruits dans la corbeille au total.

Analysis: Clean step-by-step reasoning with a correct final answer, suggesting good retention of structured CoT-style behavior.

3. Robust Python Coding (OmniCoder Heritage)

Prompt:
Écris une petite fonction Python appelée fetch_bitcoin_price. Elle doit interroger l'API publique CoinGecko pour obtenir le prix du Bitcoin en USD. Gère les erreurs de connexion proprement avec un bloc try/except, et ajoute des commentaires comme un développeur senior.

Output:

import requests

def fetch_bitcoin_price():
    """
    Récupère le prix actuel du Bitcoin en USD depuis l'API publique CoinGecko.
    Utilise la fonctionnalité 'simple' de l'API pour obtenir uniquement
    les données JSON essentielles, réduisant ainsi la latence et le volume des données.
    
    Retourne un dictionnaire contenant :
      - "price_usd": le prix du Bitcoin en dollars américains (float)
      - "last_updated": timestamp ISO 8601 de la dernière mise à jour
    
    En cas d'échec (connexion perdue, erreur HTTP ou JSON invalide),
    renvoie un dictionnaire avec une valeur None et une clé "error" contenant
    le message d'erreur.
    Ce pattern évite que l'appel principal ne plante silencieusement en cas de problème réseau.
    """
    
    # URL publique stable pour le prix simple du Bitcoin (USD) via CoinGecko API v3
    url = "https://api.coingecko.com/api/v3/simple/price"
    params = {
        "ids": "bitcoin",          # identifier l'actif : bitcoin
        "vs_currencies": "usd"     # devise de référence : dollars américains
    }
    headers = {
        "Accept": "application/json",   # préciser le format attendu (best practice)
        "User-Agent": "Python-Bitcoin-Price/1.0"  # respecter les politiques d'usage de l'API
    }

    try:
        # Exécute la requête GET avec un timeout raisonnable (5 secondes).
        response = requests.get(url, params=params, headers=headers, timeout=5)
        
        # Vérifie le statut HTTP : 200 OK signifie que la réponse est valide.
        response.raise_for_status()
        
        # Décode le JSON contenu dans la réponse.
        data = response.json()
        
        # L'API renvoie un dict dont les clés sont les IDs et les valeurs sont des dicts
        # avec les devises en tant que clés. On extrait donc data['bitcoin']['usd'].
        price_usd = float(data["bitcoin"]["usd"])
        
        return {
            "price_usd": price_usd,
            "last_updated": response.headers.get("Last-Modified", 
                                                # Fallback si le header n'est pas présent
                                                data.get("timestamp"))  # CoinGecko ne fournit pas toujours ce champ dans simple/price
        }

    except requests.exceptions.RequestException as e:
        # Capture toute exception liée aux requêtes HTTP (connexion, timeout, DNS, etc.)
        return {
            "price_usd": None,
            "error": f"Erreur de connexion : {type(e).__name__} - {e}"
        }
    except KeyError as e:
        # Gère le cas où la structure du JSON attendu n'est pas présente (API changée ou réponse vide)
        return {
            "price_usd": None,
            "error": f"Données inattendues dans la réponse API : clé manquante '{e.args[0]}'"
        }
    except ValueError as e:
        # Gère le cas où response.json() échoue (réponse non-JSON ou corrompue)
        return {
            "price_usd": None,
            "error": f"Réponse API invalide (pas du JSON) : {e}"
        }
    except Exception as e:
        # Catch-all pour tout autre imprévu (probablement très rare ici).
        # On le loggue implicitement via le message d'erreur.
        return {
            "price_usd": None,
            "error": f"Exception inattendue : {type(e).__name__} - {e}"
        }

# ==================== Exemple d'utilisation ====================
if __name__ == "__main__":
    result = fetch_bitcoin_price()
    if result["price_usd"] is not None:
        print(f"Le prix du Bitcoin est de ${result['price_usd']:.2f} USD")
    else:
        print(f"Impossible d'obtenir le prix : {result['error']}")

Analysis: Strong code quality, clear structure, detailed comments, and defensive exception handling consistent with an agentic coding fine-tune.

4. Strict JSON / Agentic Formatting (Fable Heritage)

Prompt:
Tu es un agent d'analyse de texte. Lis cette phrase : 'La Tesla Model 3 de Jean, achetée en 2021, a coûté 45000 euros.' Extrais les informations sous forme de JSON strict. Les clés doivent être : marque, modele, annee_achat, prix, devise, proprietaire. Ne génère aucun autre texte que le bloc JSON.

Output:

{"marque": "Tesla", "modele": "Model 3", "annee_achat": 2021, "prix": 45000, "devises": "euros", "proprietaire": "Jean"}

Analysis: Excellent JSON-only discipline with zero conversational bleed, but not perfect schema adherence: the model returned devises instead of the requested devise.

Intended Use

This model is intended for:

  • chat and assistant-style interaction
  • coding support
  • reasoning-heavy tasks
  • experimentation with merged specialist models
  • local inference and research workflows

Out-of-Scope Use

This model is not recommended for:

  • safety-critical applications
  • autonomous execution of high-risk instructions
  • legal, medical, or financial advice without human review
  • production environments requiring stable refusal behavior or rigorous alignment guarantees

Prompt Template (ChatML)

Since all parent models rely on the Qwen architecture, Geryon-9B-v1 uses the ChatML format.

<|im_start|>system
You are a helpful assistant.<|im_end|>

<|im_start|>user
Write a python script to fetch Bitcoin prices.<|im_end|>

<|im_start|>assistant
<think>
Reasoning process goes here...
</think>

Final answer goes here.<|im_end|>

Recommended Inference Settings

To better leverage the reasoning traces inherited from the parent models, the following sampling parameters are generally recommended:

  • Temperature: 0.6
  • Top-P: 0.95
  • Top-K: 20
  • Repetition Penalty: 1.05 to 1.10

Limitations, Risks, and Biases

  • Experimental nature:
    As a DARE-TIES merge of multiple domain-specific models, unexpected behaviors may occur, including formatting issues, prompt bleeding, or inconsistent style transfer between merged capabilities.

  • Uncensored heritage:
    At least one parent model (Qwythos) was trained on uncensored data. As a result, Geryon-9B-v1 may not reliably refuse instructions related to sensitive or unsafe topics, including offensive cybersecurity content.

  • Hallucinations:
    Like all models in the 9B class, it may confidently generate inaccurate or fabricated information, especially outside its stronger domains such as coding, logic, and structured reasoning.

  • Alignment instability from merging:
    Merge-based models can exhibit uneven alignment behavior across prompts, languages, and tasks.

Downstream users are responsible for implementing appropriate safety filters, moderation layers, and domain-specific guardrails.

License and Provenance

The model weights are released under the Apache-2.0 license, inherited from the Qwen3.5-9B base.

However, users should be aware that the fine-tuning data of the parent models may originate from generated traces of proprietary assistants (including Claude 5, GPT-5.5, and Gemini 3.1). If you plan to build on this model commercially, you should verify that your use aligns with the respective providers' Terms of Service regarding distillation, synthetic traces, and derivative training data.

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tivaphraen/Geryon-9B-v1

Collection including Tivaphraen/Geryon-9B-v1