aigencydev
/

AIGENCY-V4

+---
+license: other
+license_name: aigency-commercial
+license_link: https://aigency.dev/license
+language:
+- tr
+- en
+library_name: aigency-api
+pipeline_tag: text-generation
+tags:
+- turkish
+- multimodal
+- sovereign
+- frontier-adjacent
+- aigency
+- ecloud
+- production
+inference: false
+extra_gated_heading: AIGENCY V4 is offered via API
+extra_gated_description: |
+  Model weights are not distributed on HuggingFace. AIGENCY V4 is accessible
+  via the eCloud production API at https://aigency.dev. This page is a
+  reference card describing architecture, evaluation methodology, and
+  benchmark results, and links to a live demo Space.
+model-index:
+- name: AIGENCY V4
+  results:
+  - task:
+      type: text-generation
+      name: Code generation
+    dataset:
+      type: openai_humaneval
+      name: HumanEval (pass@1)
+    metrics:
+    - type: pass@1
+      value: 84.15
+      name: pass@1
+      verified: false
+  - task:
+      type: text-generation
+      name: Code generation extended
+    dataset:
+      type: humaneval-plus
+      name: HumanEval+ (pass@1)
+    metrics:
+    - type: pass@1
+      value: 79.88
+      name: pass@1
+      verified: false
+  - task:
+      type: text-generation
+      name: Code generation
+    dataset:
+      type: mbpp
+      name: MBPP (sanitized)
+    metrics:
+    - type: pass@1
+      value: 84.82
+      name: pass@1
+      verified: false
+  - task:
+      type: text-generation
+      name: Code generation extended
+    dataset:
+      type: mbpp-plus
+      name: MBPP+
+    metrics:
+    - type: pass@1
+      value: 78.04
+      name: pass@1
+      verified: false
+  - task:
+      type: text-generation
+      name: Mathematical reasoning
+    dataset:
+      type: gsm8k
+      name: GSM8K
+    metrics:
+    - type: accuracy
+      value: 94.62
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Multitask language understanding
+    dataset:
+      type: cais/mmlu
+      name: MMLU (stratified n=1000)
+    metrics:
+    - type: accuracy
+      value: 80.10
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Multitask language understanding (Pro)
+    dataset:
+      type: TIGER-Lab/MMLU-Pro
+      name: MMLU-Pro (n=1000)
+    metrics:
+    - type: accuracy
+      value: 50.20
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Scientific reasoning
+    dataset:
+      type: ai2_arc
+      name: ARC-Challenge
+    metrics:
+    - type: accuracy
+      value: 94.88
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Graduate-level QA
+    dataset:
+      type: idavidrein/gpqa
+      name: GPQA Diamond
+    metrics:
+    - type: accuracy
+      value: 37.88
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Truthfulness
+    dataset:
+      type: truthful_qa
+      name: TruthfulQA MC1
+    metrics:
+    - type: accuracy
+      value: 76.38
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Instruction following
+    dataset:
+      type: google/IFEval
+      name: IFEval (strict)
+    metrics:
+    - type: accuracy
+      value: 80.22
+      name: strict-prompt-level
+      verified: false
+  - task:
+      type: text-generation
+      name: Commonsense reasoning
+    dataset:
+      type: hellaswag
+      name: HellaSwag (n=1000)
+    metrics:
+    - type: accuracy
+      value: 88.60
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Coreference reasoning
+    dataset:
+      type: winogrande
+      name: WinoGrande XL
+    metrics:
+    - type: accuracy
+      value: 74.66
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Turkish reading comprehension
+    dataset:
+      type: facebook/belebele
+      name: Belebele-TR (Turkish)
+    metrics:
+    - type: accuracy
+      value: 87.33
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Turkish extractive QA
+    dataset:
+      type: tquad
+      name: TQuAD (F1 ≥ 0.5)
+    metrics:
+    - type: f1
+      value: 82.40
+      name: F1 ≥ 0.5
+      verified: false
+  - task:
+      type: text-generation
+      name: Turkish multitask understanding
+    dataset:
+      type: tr-mmlu
+      name: TR-MMLU
+    metrics:
+    - type: accuracy
+      value: 70.80
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Turkish natural-language inference
+    dataset:
+      type: xnli
+      name: XNLI-TR
+    metrics:
+    - type: accuracy
+      value: 73.40
+      name: accuracy
+      verified: false
+  - task:
+      type: text-generation
+      name: Turkish grammar
+    dataset:
+      type: tr-grammar-synthetic
+      name: TR Grammar (synthetic 50/50)
+    metrics:
+    - type: accuracy
+      value: 79.00
+      name: accuracy
+      verified: false
+  - task:
+      type: image-text-to-text
+      name: Multimodal QA
+    dataset:
+      type: MMMU
+      name: MMMU (val, n=30)
+    metrics:
+    - type: accuracy
+      value: 53.33
+      name: accuracy
+      verified: false
+  - task:
+      type: image-text-to-text
+      name: Chart QA
+    dataset:
+      type: HuggingFaceM4/ChartQA
+      name: ChartQA (relaxed)
+    metrics:
+    - type: accuracy
+      value: 67.68
+      name: relaxed accuracy
+      verified: false
+  - task:
+      type: image-text-to-text
+      name: Document QA
+    dataset:
+      type: lmms-lab/DocVQA
+      name: DocVQA (ANLS ≥ 0.5)
+    metrics:
+    - type: accuracy
+      value: 79.17
+      name: ANLS ≥ 0.5
+      verified: false
+  - task:
+      type: image-text-to-text
+      name: Visual mathematical reasoning
+    dataset:
+      type: AI4Math/MathVista
+      name: MathVista (testmini)
+    metrics:
+    - type: accuracy
+      value: 34.13
+      name: accuracy
+      verified: false
+---
+# AIGENCY V4
+> **Sovereign, fully independent, multimodal — 128B parameters.**
+> A globally competitive Turkish-first AI model: world-leading on Turkish
+> reading comprehension and natural-language inference, frontier-level on
+> grade-school math and scientific reasoning, KVKK-resident.
+[**🇹🇷 Türkçe README**](#türkçe) · [**🇬🇧 English README**](#english) · [**📄 Whitepaper (EN)**](https://github.com/ecloud-bh/aigency-v4-whitepaper/blob/main/AIGENCY-V4-Whitepaper-EN.pdf) · [**📄 Whitepaper (TR)**](https://github.com/ecloud-bh/aigency-v4-whitepaper/blob/main/AIGENCY-V4-Whitepaper-TR.pdf) · [**🌐 Try the demo**](https://huggingface.co/spaces/aigencydev/AIGENCY-V4-Demo) · [**🔗 API**](https://aigency.dev)
+---
+## English
+### Model summary
+**AIGENCY V4** is the multimodal successor to AIGENCY V3, developed by
+**eCloud Yazılım Teknolojileri** and released to production in Q2 2026.
+The model retains V3's four sovereignty principles — zero external parameter
+dependency, sovereign data residency, transparent architectural documentation,
+and Turkish morphological context fidelity — and adds a sovereign 8B-parameter
+vision encoder for image, document, chart, and visual-math understanding.
+| | |
+|---|---|
+| **Total parameters** | 128B (120B core + 8B vision encoder) |
+| **Architecture** | Sovereign decoder-only transformer + side vision encoder |
+| **Optimisations** | Adaptive LoRA+, Selective Layer Collapse, Localised MoE, 4-bit block quantization, chunked attention |
+| **Context window** | 278K tokens (HBM 3-tier: STM 4k / ITM 64k / LTM 278k) |
+| **Active inference memory** | ~6.5 GB GPU under 4-bit quant |
+| **Languages** | Turkish (primary), English |
+| **Modalities** | Text, image (one image per request, 30 MB max, image/* MIME) |
+| **Release version** | 1.0 production |
+| **Release date** | April 2026 |
+| **Licence** | API-only commercial — see https://aigency.dev/license |
+### Distribution
+**Weights are not distributed.** AIGENCY V4 is accessed exclusively through
+the eCloud production API at `https://aigency.dev/api/v2`. This page provides
+the architectural specification, the evaluation methodology, and the full
+benchmark results. To try the model interactively, use the
+[demo Space](https://huggingface.co/spaces/aigencydev/AIGENCY-V4-Demo). For
+production access, see [aigency.dev](https://aigency.dev).
+### Evaluation
+A comprehensive single-session evaluation was conducted on **27 April 2026**
+against the production API. **13,344 real API calls** across **22 distinct
+benchmarks** were executed; every result is reported with a Wilson 95%
+confidence interval, deterministic subsampling (seed=42), and an open dataset
+identifier.
+#### Tier 1 — Critical benchmarks (full set)
+| Benchmark | Accuracy | Wilson 95% CI | n | Errors |
+|---|---|---|---|---|
+| HumanEval (pass@1) | **0.8415** | [0.778, 0.889] | 164/164 | 0 |
+| IFEval (strict) | **0.8022** | [0.767, 0.834] | 541/541 | 1 |
+| GPQA Diamond | 0.3788 | [0.314, 0.448] | 198/198 | 0 |
+| Belebele-TR | **0.8733** | [0.850, 0.893] | 900/900 | 0 |
+| ARC-Challenge | **0.9488** | [0.935, 0.960] | 1172/1172 | 0 |
+| TruthfulQA MC1 | **0.7638** | [0.734, 0.792] | 817/817 | 0 |
+| GSM8K | **0.9462** | [0.933, 0.957] | 1319/1319 | 0 |
+#### Tier 2 — Mid-volume
+| Benchmark | Accuracy | Wilson 95% CI | n |
+|---|---|---|---|
+| MMLU (stratified) | **0.8010** | [0.775, 0.825] | 1000/1000 |
+| MMLU-Pro | 0.5020 | [0.471, 0.533] | 1000/1000 |
+| HellaSwag | **0.8860** | [0.865, 0.904] | 1000/1000 |
+| WinoGrande XL | 0.7466 | [0.722, 0.770] | 1267/1267 |
+| HumanEval+ (extended) | **0.7988** | [0.731, 0.853] | 164/164 |
+| MBPP (sanitized) | **0.8482** | [0.799, 0.887] | 257/257 |
+| MBPP+ | **0.7804** | [0.736, 0.819] | 378/378 |
+#### Tier 3-A — Turkish (V4 is the de-facto global reference)
+| Benchmark | Accuracy | Wilson 95% CI | n |
+|---|---|---|---|
+| Belebele-TR | **0.8733** | [0.850, 0.893] | 900/900 |
+| TQuAD (F1 ≥ 0.5) | **0.8240** | [0.788, 0.855] | 500/500 |
+| TR-MMLU | **0.7080** | [0.667, 0.746] | 500/500 |
+| XNLI-TR | **0.7340** | [0.694, 0.771] | 500/500 |
+| TR Grammar (synthetic) | **0.7900** | [0.700, 0.858] | 100/100 |
+> Frontier models do not consistently publish Turkish-specific scores.
+> Within published global evaluation, AIGENCY V4 is the **Turkish reference**.
+#### Tier 3-B — Multimodal (first production release)
+| Benchmark | Accuracy | Wilson 95% CI | n |
+|---|---|---|---|
+| MMMU (val) | 0.5333 | [0.361, 0.698] | 30/30 |
+| ChartQA (relaxed) | 0.6768 | [0.634, 0.717] | 492/500 |
+| DocVQA (ANLS ≥ 0.5) | 0.7917 | [0.595, 0.908] | 24 |
+| MathVista (testmini) | 0.3413 | [0.280, 0.408] | 208 |
+### Comparison with frontier (April 2026)
+| Benchmark | AIGENCY V4 | GPT-5 | Claude 4.6/4.7 | Gemini 3 Pro |
+|---|---|---|---|---|
+| GSM8K | **94.62** | 96.8 | ~96 | ~94 |
+| ARC-Challenge | **94.88** | ~96 | ~96 | ~95 |
+| HumanEval | 84.15 | 94.0 | 95.0 | 89.7 |
+| MMLU | 80.10 | 94.2 | 88-93 | 92.4 |
+| MMLU-Pro | 50.20 | ~85 | ~84 | ~81 |
+| GPQA Diamond | 37.88 | 88-94 | 91.3-94.2 | 91.9 |
+| MMMU | 53.33 | 79.1 | 84.1 | — |
+V4 is **at frontier level on grade-school math and scientific reasoning**,
+**upper-mid frontier on code generation**, **lower-mid frontier on general
+academic and instruction following**, and **in active development on
+graduate-level expert knowledge and multimodal**. The V4.1 roadmap (Q4 2026)
+targets MMLU-Pro 0.65, GPQA Diamond 0.55, and average latency 4 s.
+### Operational performance (single-session, 27 April 2026)
+- Total API calls: 13,344
+- Persistent error rate: 0.3%
+- Average latency: 9.55 s · p50 4.39 s · p95 32.77 s · p99 33.59 s
+- V4.1 latency target: average ≤ 4 s · p95 ≤ 15 s
+### Reproducibility
+Full evaluation harness, raw responses, scored items, summary JSON, and the
+deterministic subsample seed are available at:
+- **Benchmark code**: https://github.com/ecloud-bh/aigency-benchmarks
+- **Evaluation results dataset**: https://huggingface.co/datasets/aigencydev/aigency-v4-evaluation
+- **Whitepaper (EN/TR)**: https://github.com/ecloud-bh/aigency-v4-whitepaper
+### Intended use
+**Primary deployment domains:**
+1. Public-sector and government workloads requiring KVKK residency
+2. Legal and legal-tech (statute search, contract analysis — Tural model integration)
+3. Education and higher education (Turkish academic, exam prep, course assistants)
+4. Banking, finance and insurance (Turkish-heavy KYC/AML)
+5. Healthcare administrative workloads (KVKK-compliant document handling)
+6. Media, publishing and editorial (Turkish grammar precision)
+7. Defence and critical infrastructure (sovereign architecture)
+8. Software, R&D and engineering (code generation, large-codebase analysis)
+**Out-of-scope or non-recommended:**
+- Clinical diagnosis or medical advice (administrative use only)
+- Autonomous critical decisions without human review
+- Graduate-level scientific research where GPQA-Diamond–class accuracy is required (use frontier model + V4 hybrid)
+- High-fidelity multimodal reasoning where MMMU > 75 is required (await V4.1)
+### Safety and compliance
+- KVKK §5 / §12 (Turkish PDPA) compliant — KVKK-resident hosting (TR DC)
+- ISO/IEC 27001 — IT-ISMS, risk and control matrix
+- NIST SP 800-207 (Zero-Trust) — mTLS, least privilege, continuous monitoring
+- EU AI Act (ratified 2025) — high-risk classification with model card
+- Memory encryption: AES-256-XTS (RAM), ChaCha20-Poly1305 (LTM disk)
+- Image cache: AES-256-GCM, 30 MB limit, 24h TTL
+- Pre-encoding visual safety filter + post-encoding output check
+### Known limitations
+1. **GPQA Diamond / MMLU-Pro gap** — 35-50pp behind frontier; graduate-level expert knowledge is a V4.1 target.
+2. **First-generation multimodal** — vision encoder is 8B; V4.1 plans to scale to 16B.
+3. **Latency 2-3× frontier** — vision-encoder overhead, multimodal safety filter; V4.1 targets ≤ 4 s avg.
+4. **Multimodal subsample size** — DocVQA n=24, MMMU n=30 (HF cache constraints); CIs are wide.
+5. **Multilingual non-TR evaluation not published** — global-scale claim is currently Turkish-anchored.
+### Citation
+```bibtex
+@techreport{aigency-v4-2026,
+  title  = {AIGENCY V4: Sovereign, Fully Independent and Multimodal 128B-Parameter AI Architecture},
+  author = {{eCloud Yaz{\i}l{\i}m Teknolojileri}},
+  year   = {2026},
+  month  = apr,
+  institution = {eCloud Yaz{\i}l{\i}m Teknolojileri},
+  url    = {https://github.com/ecloud-bh/aigency-v4-whitepaper},
+  note   = {Whitepaper v1.0, April 2026}
+}
+```
+---
+## Türkçe
+### Model özeti
+**AIGENCY V4**, eCloud Yazılım Teknolojileri tarafından geliştirilen, V3'ün
+multimodal halefi olan 128 milyar parametreli yerli yapay zekâ modelidir.
+2026/Q2'de üretime alındı. V3'ün dört bağımsızlık ilkesini (dış parametre
+sıfırlama, yerel veri egemenliği, şeffaf belgeleme, Türkçe bağlam uyumu)
+korur ve görsel anlama, belge soru-cevap, grafik yorumlama, görsel matematik
+yetkinliklerini ekleyen 8B parametreli yerli vision encoder ile genişletir.
+| | |
+|---|---|
+| **Toplam parametre** | 128B (120B çekirdek + 8B vision encoder) |
+| **Mimari** | Yerli decoder-only transformer + yan vision encoder |
+| **Optimizasyonlar** | Adaptif LoRA+, Selective Layer Collapse, L-MoE, 4-bit blok kuantizasyon, öbekli dikkat |
+| **Bağlam penceresi** | 278K token (HBM 3-katmanlı: STM 4k / ITM 64k / LTM 278k) |
+| **Aktif inferans bellek** | 4-bit kuantizasyon altında ~6.5 GB GPU |
+| **Diller** | Türkçe (birincil), İngilizce |
+| **Modaliteler** | Metin, görsel (istek başına bir görsel, max 30 MB, image/* MIME) |
+| **Sürüm** | 1.0 üretim |
+| **Yayın tarihi** | Nisan 2026 |
+| **Lisans** | API-only ticari — https://aigency.dev/license |
+### Dağıtım
+**Ağırlıklar HuggingFace'de paylaşılmaz.** AIGENCY V4'e erişim yalnızca
+`https://aigency.dev/api/v2` üzerinden sağlanır. Bu sayfa mimari
+spesifikasyonu, değerlendirme metodolojisini ve tam benchmark sonuçlarını
+sunar. Modeli interaktif olarak denemek için
+[demo Space](https://huggingface.co/spaces/aigencydev/AIGENCY-V4-Demo)
+sayfasını kullanın. Üretim erişimi için: [aigency.dev](https://aigency.dev).
+### Konumlandırma — Tek cümlede
+AIGENCY V4, Türkçe okuma anlama ve doğal dil çıkarımında dünya lideri,
+fen muhakemesi ve grade-school matematikte küresel frontier seviyesinde,
+kod üretiminde üst-orta frontier segmentinde, multimodal ve graduate-level
+uzman bilgide aktif geliştirme aşamasında, tam-bağımsız ve KVKK-yerel bir
+yerli yapay zekâ modelidir.
+### Hedef kullanım alanları
+1. Kamu sektörü ve devlet kurumları (KVKK gereksinimi)
+2. Hukuk ve hukuk teknolojileri (mevzuat arama, sözleşme analizi)
+3. Eğitim ve yükseköğretim (Türkçe akademik, sınav hazırlık)
+4. Bankacılık, finans ve sigorta (Türkçe-yoğun KYC/AML)
+5. Sağlık idari iş yükleri (KVKK uyumlu belge işleme)
+6. Medya, yayıncılık ve editoryal (Türkçe dilbilgisi titizliği)
+7. Savunma ve kritik altyapı (egemen mimari)
+8. Yazılım, AR-GE ve mühendislik
+### Bilinen kısıtlar
+1. GPQA Diamond / MMLU-Pro frontier'ın 35-50pp gerisinde — V4.1 hedefi.
+2. Multimodal ilk üretim sürümü — V4.1'de 16B vision encoder planlandı.
+3. Latency frontier'ın 2-3 katı — V4.1 hedefi ≤ 4 s ortalama.
+4. Multimodal subsample boyutu küçük (DocVQA n=24, MMMU n=30); CI geniş.
+5. TR-dışı çok-dilli profil yayımlanmadı — küresel iddia şu an TR-merkezli.
+### Atıf
+```bibtex
+@techreport{aigency-v4-2026,
+  title  = {AIGENCY V4: Yerli, Tam Ba{\u g}{\i}ms{\i}z ve Multimodal 128B Parametreli Yapay Zek\^a Mimarisi},
+  author = {{eCloud Yaz{\i}l{\i}m Teknolojileri}},
+  year   = {2026},
+  month  = apr,
+  institution = {eCloud Yaz{\i}l{\i}m Teknolojileri},
+  url    = {https://github.com/ecloud-bh/aigency-v4-whitepaper}
+}
+```
+---
+## License
+AIGENCY V4 is offered under the **eCloud AIGENCY Commercial Licence** (API-only).
+Model weights are not redistributed. The accompanying whitepaper is licensed
+under **CC BY-ND 4.0**, and the benchmark code is licensed under **MIT**.
+For commercial use, partnership, or research collaboration:
+**info@e-cloud.web.tr · ai@aigency.dev** · https://aigency.dev
+© 2026 eCloud Yazılım Teknolojileri.