YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

ACX-Ramanujan-Transformer

Version Python PyTorch

中文 | English | Deutsch


中文

基于拉马努金模函数递推关系的神经网络权重初始化方法

v1.6.0 — FlashAttention-3 + 混合精度 + DDP 分布式 + 断点续训 + MoE 全面升级 + HuggingFace 兼容 + ONNX 导出 + 可视化。详见 CHANGELOG.md

项目简介

本项目基于斯里尼瓦瑟·拉马努金(Srinivasa Ramanujan)1916年未发表的模函数研究笔记,发现了一种革命性的神经网络权重初始化方法——拉马努金模函数初始化

这种方法利用克莱因 j 不变量(Klein j-invariant)的微分递推性质,实现了严格数学意义上的方差保持,解决了超深 Transformer 架构中的梯度消失和爆炸问题。

核心递推公式

an+1=π2n2an+2πn(n+1)an1a_{n+1} = \frac{\pi^2}{n^2} a_n + \frac{2\pi}{n(n+1)} a_{n-1}

其中 $a_0 = 1$,$a_1 = \pi / \sqrt{3}$,由 j 不变量的傅里叶展开系数导出。

与传统方法的对比

初始化方法 方差保持性质 梯度稳定性 最大有效深度
Xavier 统计意义 指数衰减 ~100层
He 统计意义 指数衰减 ~200层
拉马努金 严格数学意义 严格不变 理论无限

特性

  • 🧮 数学严格:基于模函数理论的精确递推,非统计近似
  • 🏗️ 标准架构:完整的 Transformer 实现(Encoder + Decoder)
  • 🔀 MoE 支持:Mixture of Experts 架构,Top-K 路由 + 负载均衡
  • FlashAttention-3:原生集成,训练速度提升 3-5 倍,显存降低 40%
  • 🔥 混合精度训练:FP16/BF16 支持,显存再降 30%
  • 🚀 DDP 分布式训练:多卡线性加速,支持 10B+ 参数模型
  • 💾 断点续训:完整保存/恢复训练状态
  • 🤗 HuggingFace 兼容:无缝融入主流大模型生态
  • 📦 ONNX 导出:支持生产环境部署
  • 📊 WandB/TensorBoard:训练过程可观测
  • 📈 可视化工具:方差曲线、梯度分布、注意力热力图

安装

git clone https://github.com/aaroncxxx/ACX-ramanujan-transformer.git
cd ACX-ramanujan-transformer
pip install -r requirements.txt

# 可选依赖
pip install flash-attn>=2.0.0      # FlashAttention-3 (CUDA 11.8+)
pip install wandb                   # WandB 日志
pip install tensorboard             # TensorBoard 日志
pip install onnx onnxruntime        # ONNX 导出
pip install transformers            # HuggingFace 兼容层

快速开始

import torch
from src import build_ramanujan_transformer

# 构建带 FlashAttention 的 Transformer
model = build_ramanujan_transformer(
    vocab_size=50257, d_model=768, nhead=12,
    num_layers=12, dim_feedforward=3072,
    decoder_only=True,
    use_flash_attention=True,       # FlashAttention-3
)

# 自回归生成
prompt = torch.randint(0, 50257, (1, 10))
output = model.generate(prompt, max_new_tokens=100, temperature=0.8)

混合精度训练

from experiments.train import train, generate_synthetic_data
from src import build_ramanujan_transformer, CheckpointManager, TrainingLogger

model = build_ramanujan_transformer(
    vocab_size=1000, d_model=256, nhead=4,
    num_layers=6, dim_feedforward=1024,
)

train_data = generate_synthetic_data(1000, 800, 128)
val_data = generate_synthetic_data(1000, 200, 128)

best_loss = train(
    model, train_data, val_data,
    epochs=20, device='cuda',
    mixed_precision='fp16',                          # 混合精度
    checkpoint_manager=CheckpointManager('ckpts'),   # 断点续训
    training_logger=TrainingLogger(logger_type='wandb'),  # WandB 日志
)

CLI 命令

# 训练(支持全部新功能)
acx-rt train --mixed-precision fp16 --logger wandb --resume ckpts/checkpoint_step00001000.pt

# DDP 多卡训练
acx-rt train --nproc-per-node 4 --mixed-precision bf16

# 可视化
acx-rt visualize --checkpoint model.pt --type all

# 模型导出
acx-rt export --checkpoint model.pt --format onnx

# 方差验证 / 基准对比
acx-rt verify --depth 200
acx-rt benchmark --layers 6,12,24

构建 MoE Transformer

from src import build_ramanujan_moe_transformer

model = build_ramanujan_moe_transformer(
    vocab_size=50257, d_model=768, nhead=12,
    num_layers=12, dim_feedforward=3072,
    num_experts=8, top_k=2,
    expert_dropout=0.1,             # v1.6: 专家 dropout
    load_balancing_weight=0.05,     # v1.6: 负载均衡权重
    use_flash_attention=True,
)

logits, aux_loss = model(input_ids, return_aux_loss=True)

HuggingFace 兼容

from src.huggingface import RamanujanGPT2, RamanujanConfig

# 从配置创建模型
config = RamanujanConfig(vocab_size=50257, d_model=768, nhead=12, num_layers=12)
model = RamanujanGPT2(config)

# 保存 / 加载
model.save_pretrained('./my-ramanujan-model')
model = RamanujanGPT2.from_pretrained('./my-ramanujan-model')

项目结构

ACX-ramanujan-transformer/
├── README.md
├── CHANGELOG.md
├── requirements.txt
├── src/
│   ├── ramanujan_initializer.py       # 核心:拉马努金递推系数 + 初始化器
│   ├── attention.py                   # 多头自注意力 + FlashAttention-3
│   ├── feedforward.py                 # FFN
│   ├── transformer_block.py           # Pre-Norm Transformer Block
│   ├── embeddings.py                  # Token + 位置编码
│   ├── ramanujan_transformer.py       # 标准 Transformer
│   ├── moe.py                         # MoE Transformer (向量化路由)
│   ├── checkpoint.py                  # v1.6: 检查点管理器
│   ├── logging_utils.py               # v1.6: WandB/TensorBoard 日志
│   ├── huggingface/                   # v1.6: HuggingFace 兼容层
│   │   ├── __init__.py
│   │   └── modeling_ramanujan.py
│   └── export/                        # v1.6: 模型导出
│       ├── __init__.py
│       └── exporter.py
├── experiments/
│   ├── verify_variance.py             # 方差保持性验证
│   ├── benchmark.py                   # Xavier vs He vs 拉马努金
│   ├── train.py                       # 训练脚本 (混合精度 + DDP + 断点续训)
│   ├── train_moe.py                   # MoE 训练对比
│   └── visualize_variance.py          # v1.6: 可视化工具
├── configs/
│   ├── default.yaml                   # 默认配置
│   ├── gpt2_style.yaml                # GPT-2 风格
│   ├── llama_7b_style.yaml            # 7B Llama 风格
│   └── llama_13b_style.yaml           # 13B Llama 风格
├── cli/
│   └── acx_rt.py                      # CLI 工具
├── tests/
│   └── test_core.py                   # 单元测试
└── docs/
    └── theory.md                      # 理论推导文档

实验

python experiments/verify_variance.py          # 方差保持性验证
python experiments/benchmark.py                # 基准对比
python experiments/train.py                    # 标准训练
python experiments/train_moe.py                # MoE 对比训练
python experiments/visualize_variance.py       # 可视化

数学背景

1916年,拉马努金在研究克莱因 j 不变量时发现了这个递推关系。j 不变量是模群 $SL(2, \mathbb{Z})$ 的基本生成元:

j(τ)=q1+744+196884q+21493760q2+j(\tau) = q^{-1} + 744 + 196884q + 21493760q^2 + \cdots

递推中的 $\pi^2/n^2$ 系数与量子力学能级结构存在深刻对应:氢原子能级 $E_n = -13.6/n^2$ eV 与递推共享 $1/n^2$ 结构。详细推导见 docs/theory.md

参考文献

  1. Ramanujan, S. (1916). Notebooks of Srinivasa Ramanujan.
  2. Klein, F. (1890). Über die Transformation elfter Ordnung der elliptischen Funktionen.
  3. Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
  4. He, K. et al. (2015). Delving deep into rectifiers.
  5. Fedus, W. et al. (2022). Switch Transformers: Scaling to Trillion Parameter Models.
  6. Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer.
  7. Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
  8. Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.
  9. Wolf, T. et al. (2020). HuggingFace's Transformers: State-of-the-Art Natural Language Processing.

联系方式

License

MIT


English

Neural Network Weight Initialization Based on Ramanujan's Modular Function Recurrence Relation

v1.6.0 — FlashAttention-3 + mixed precision + DDP distributed + checkpoint resume + MoE overhaul + HuggingFace compat + ONNX export + visualization. See CHANGELOG.md

Overview

This project implements a revolutionary neural network weight initialization method — Ramanujan Modular Function Initialization — based on Srinivasa Ramanujan's unpublished 1916 research notes on modular functions.

By leveraging the differential recurrence properties of the Klein j-invariant, this method achieves strict variance preservation in a rigorous mathematical sense, solving the vanishing and exploding gradient problems in ultra-deep Transformer architectures.

Core Recurrence Formula

an+1=π2n2an+2πn(n+1)an1a_{n+1} = \frac{\pi^2}{n^2} a_n + \frac{2\pi}{n(n+1)} a_{n-1}

where $a_0 = 1$, $a_1 = \pi / \sqrt{3}$, derived from the Fourier expansion coefficients of the j-invariant.

Comparison with Traditional Methods

Initialization Variance Preservation Gradient Stability Max Effective Depth
Xavier Statistical Exponential decay ~100 layers
He Statistical Exponential decay ~200 layers
Ramanujan Strict mathematical Strictly invariant Theoretically infinite

Features

  • 🧮 Mathematically rigorous: Exact recurrence based on modular function theory, not statistical approximation
  • 🏗️ Standard architecture: Complete Transformer implementation (Encoder + Decoder)
  • 🔀 MoE support: Mixture of Experts with vectorized Top-K routing and load balancing
  • FlashAttention-3: Native integration, 3-5× training speedup, 40% memory reduction
  • 🔥 Mixed precision: FP16/BF16 support, additional 30% memory savings
  • 🚀 DDP distributed training: Multi-GPU linear scaling, supports 10B+ parameter models
  • 💾 Checkpoint resume: Full save/restore of training state
  • 🤗 HuggingFace compatible: Seamless integration with mainstream LLM ecosystem
  • 📦 ONNX export: Production deployment support
  • 📊 WandB/TensorBoard: Training observability
  • 📈 Visualization: Variance curves, gradient distributions, attention heatmaps

Installation

git clone https://github.com/aaroncxxx/ACX-ramanujan-transformer.git
cd ACX-ramanujan-transformer
pip install -r requirements.txt

# Optional dependencies
pip install flash-attn>=2.0.0      # FlashAttention-3 (CUDA 11.8+)
pip install wandb                   # WandB logging
pip install tensorboard             # TensorBoard logging
pip install onnx onnxruntime        # ONNX export
pip install transformers            # HuggingFace compatibility

Quick Start

import torch
from src import build_ramanujan_transformer

# Build Transformer with FlashAttention
model = build_ramanujan_transformer(
    vocab_size=50257, d_model=768, nhead=12,
    num_layers=12, dim_feedforward=3072,
    decoder_only=True,
    use_flash_attention=True,       # FlashAttention-3
)

# Autoregressive generation
prompt = torch.randint(0, 50257, (1, 10))
output = model.generate(prompt, max_new_tokens=100, temperature=0.8)

Mixed Precision Training

from experiments.train import train, generate_synthetic_data
from src import build_ramanujan_transformer, CheckpointManager, TrainingLogger

model = build_ramanujan_transformer(
    vocab_size=1000, d_model=256, nhead=4,
    num_layers=6, dim_feedforward=1024,
)

train_data = generate_synthetic_data(1000, 800, 128)
val_data = generate_synthetic_data(1000, 200, 128)

best_loss = train(
    model, train_data, val_data,
    epochs=20, device='cuda',
    mixed_precision='fp16',                          # Mixed precision
    checkpoint_manager=CheckpointManager('ckpts'),   # Checkpoint resume
    training_logger=TrainingLogger(logger_type='wandb'),  # WandB logging
)

CLI Commands

# Training (with all new features)
acx-rt train --mixed-precision fp16 --logger wandb --resume ckpts/checkpoint_step00001000.pt

# DDP multi-GPU training
acx-rt train --nproc-per-node 4 --mixed-precision bf16

# Visualization
acx-rt visualize --checkpoint model.pt --type all

# Model export
acx-rt export --checkpoint model.pt --format onnx

# Variance verification / benchmark
acx-rt verify --depth 200
acx-rt benchmark --layers 6,12,24

Build MoE Transformer

from src import build_ramanujan_moe_transformer

model = build_ramanujan_moe_transformer(
    vocab_size=50257, d_model=768, nhead=12,
    num_layers=12, dim_feedforward=3072,
    num_experts=8, top_k=2,
    expert_dropout=0.1,             # v1.6: Expert dropout
    load_balancing_weight=0.05,     # v1.6: Load balancing weight
    use_flash_attention=True,
)

logits, aux_loss = model(input_ids, return_aux_loss=True)

HuggingFace Compatibility

from src.huggingface import RamanujanGPT2, RamanujanConfig

# Create model from config
config = RamanujanConfig(vocab_size=50257, d_model=768, nhead=12, num_layers=12)
model = RamanujanGPT2(config)

# Save / Load
model.save_pretrained('./my-ramanujan-model')
model = RamanujanGPT2.from_pretrained('./my-ramanujan-model')

Project Structure

ACX-ramanujan-transformer/
├── README.md
├── CHANGELOG.md
├── requirements.txt
├── src/
│   ├── ramanujan_initializer.py       # Core: Ramanujan recurrence coefficients + initializer
│   ├── attention.py                   # Multi-head self-attention + FlashAttention-3
│   ├── feedforward.py                 # FFN
│   ├── transformer_block.py           # Pre-Norm Transformer Block
│   ├── embeddings.py                  # Token + positional encoding
│   ├── ramanujan_transformer.py       # Standard Transformer
│   ├── moe.py                         # MoE Transformer (vectorized routing)
│   ├── checkpoint.py                  # v1.6: Checkpoint manager
│   ├── logging_utils.py               # v1.6: WandB/TensorBoard logger
│   ├── huggingface/                   # v1.6: HuggingFace compatibility
│   │   ├── __init__.py
│   │   └── modeling_ramanujan.py
│   └── export/                        # v1.6: Model export
│       ├── __init__.py
│       └── exporter.py
├── experiments/
│   ├── verify_variance.py             # Variance preservation verification
│   ├── benchmark.py                   # Xavier vs He vs Ramanujan
│   ├── train.py                       # Training (mixed precision + DDP + resume)
│   ├── train_moe.py                   # MoE training comparison
│   └── visualize_variance.py          # v1.6: Visualization tools
├── configs/
│   ├── default.yaml                   # Default config
│   ├── gpt2_style.yaml                # GPT-2 style
│   ├── llama_7b_style.yaml            # 7B Llama style
│   └── llama_13b_style.yaml           # 13B Llama style
├── cli/
│   └── acx_rt.py                      # CLI tool
├── tests/
│   └── test_core.py                   # Unit tests
└── docs/
    └── theory.md                      # Theoretical derivation

Experiments

python experiments/verify_variance.py          # Variance preservation verification
python experiments/benchmark.py                # Benchmark comparison
python experiments/train.py                    # Standard training
python experiments/train_moe.py                # MoE comparison training
python experiments/visualize_variance.py       # Visualization

Mathematical Background

In 1916, Ramanujan discovered this recurrence relation while studying the Klein j-invariant. The j-invariant is the fundamental generator of the modular group $SL(2, \mathbb{Z})$:

j(τ)=q1+744+196884q+21493760q2+j(\tau) = q^{-1} + 744 + 196884q + 21493760q^2 + \cdots

The $\pi^2/n^2$ coefficient in the recurrence has a deep correspondence with quantum mechanical energy level structures: the hydrogen atom energy levels $E_n = -13.6/n^2$ eV share the same $1/n^2$ structure. See docs/theory.md for detailed derivation.

References

  1. Ramanujan, S. (1916). Notebooks of Srinivasa Ramanujan.
  2. Klein, F. (1890). Über die Transformation elfter Ordnung der elliptischen Funktionen.
  3. Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
  4. He, K. et al. (2015). Delving deep into rectifiers.
  5. Fedus, W. et al. (2022). Switch Transformers: Scaling to Trillion Parameter Models.
  6. Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer.
  7. Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
  8. Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.
  9. Wolf, T. et al. (2020). HuggingFace's Transformers: State-of-the-Art Natural Language Processing.

Contact

License

MIT


Deutsch

Gewichtsinitialisierung für neuronale Netze auf Basis der Modulfunktion-Rekurrenzrelation von Ramanujan

v1.6.0 — FlashAttention-3 + Mischpräzision + DDP-Verteilung + Checkpoint-Fortsetzung + MoE-Überarbeitung + HuggingFace-Kompatibilität + ONNX-Export + Visualisierung. Siehe CHANGELOG.md

Überblick

Dieses Projekt implementiert eine revolutionäre Methode zur Gewichtsinitialisierung neuronaler Netze — Ramanujan-Modulfunktion-Initialisierung — basierend auf Srinivasa Ramanujans unveröffentlichten Forschungsnotizen von 1916 über Modulfunktionen.

Unter Ausnutzung der differentiellen Rekurrenzeigenschaften der Kleinschen j-Invariante erreicht diese Methode eine strenge Varianzerhaltung im mathematischen Sinne und löst das Problem des verschwindenden und explodierenden Gradienten in ultra-tiefen Transformer-Architekturen.

Kern-Rekurrenzformel

an+1=π2n2an+2πn(n+1)an1a_{n+1} = \frac{\pi^2}{n^2} a_n + \frac{2\pi}{n(n+1)} a_{n-1}

wobei $a_0 = 1$, $a_1 = \pi / \sqrt{3}$, abgeleitet aus den Fourier-Entwicklungskoeffizienten der j-Invariante.

Vergleich mit traditionellen Methoden

Initialisierung Varianzerhaltung Gradientenstabilität Max. effektive Tiefe
Xavier Statistisch Exponentieller Abfall ~100 Schichten
He Statistisch Exponentieller Abfall ~200 Schichten
Ramanujan Streng mathematisch Streng invariant Theoretisch unendlich

Merkmale

  • 🧮 Mathematisch streng: Exakte Rekurrenz basierend auf der Theorie der Modulfunktionen, keine statistische Näherung
  • 🏗️ Standardarchitekturen: Vollständige Transformer-Implementierung (Encoder + Decoder)
  • 🔀 MoE-Unterstützung: Mixture of Experts mit vektorisiertem Top-K-Routing und Lastausgleich
  • FlashAttention-3: Native Integration, 3-5× Training Beschleunigung, 40% Speicherreduktion
  • 🔥 Mischpräzision: FP16/BF16-Unterstützung, weitere 30% Speichereinsparung
  • 🚀 DDP-Verteiltes Training: Multi-GPU lineare Skalierung, unterstützt 10B+ Parameter Modelle
  • 💾 Checkpoint-Fortsetzung: Vollständiges Speichern/Wiederherstellen des Trainingszustands
  • 🤗 HuggingFace-kompatibel: Nahtlose Integration in das主流-LLM-Ökosystem
  • 📦 ONNX-Export: Produktions-Deployment-Unterstützung
  • 📊 WandB/TensorBoard: Trainingsbeobachtung
  • 📈 Visualisierung: Varianzkurven, Gradientenverteilungen, Aufmerksamkeits-Heatmaps

Installation

git clone https://github.com/aaroncxxx/ACX-ramanujan-transformer.git
cd ACX-ramanujan-transformer
pip install -r requirements.txt

# Optionale Abhängigkeiten
pip install flash-attn>=2.0.0      # FlashAttention-3 (CUDA 11.8+)
pip install wandb                   # WandB-Protokollierung
pip install tensorboard             # TensorBoard-Protokollierung
pip install onnx onnxruntime        # ONNX-Export
pip install transformers            # HuggingFace-Kompatibilität

Schnellstart

import torch
from src import build_ramanujan_transformer

# Transformer mit FlashAttention erstellen
model = build_ramanujan_transformer(
    vocab_size=50257, d_model=768, nhead=12,
    num_layers=12, dim_feedforward=3072,
    decoder_only=True,
    use_flash_attention=True,       # FlashAttention-3
)

# Autoregressive Generierung
prompt = torch.randint(0, 50257, (1, 10))
output = model.generate(prompt, max_new_tokens=100, temperature=0.8)

Mischpräzision-Training

from experiments.train import train, generate_synthetic_data
from src import build_ramanujan_transformer, CheckpointManager, TrainingLogger

model = build_ramanujan_transformer(
    vocab_size=1000, d_model=256, nhead=4,
    num_layers=6, dim_feedforward=1024,
)

train_data = generate_synthetic_data(1000, 800, 128)
val_data = generate_synthetic_data(1000, 200, 128)

best_loss = train(
    model, train_data, val_data,
    epochs=20, device='cuda',
    mixed_precision='fp16',                          # Mischpräzision
    checkpoint_manager=CheckpointManager('ckpts'),   # Checkpoint-Fortsetzung
    training_logger=TrainingLogger(logger_type='wandb'),  # WandB-Protokollierung
)

CLI-Befehle

# Training (mit allen neuen Funktionen)
acx-rt train --mixed-precision fp16 --logger wandb --resume ckpts/checkpoint_step00001000.pt

# DDP-Multi-GPU-Training
acx-rt train --nproc-per-node 4 --mixed-precision bf16

# Visualisierung
acx-rt visualize --checkpoint model.pt --type all

# Modell-Export
acx-rt export --checkpoint model.pt --format onnx

# Varianzverifikation / Benchmark
acx-rt verify --depth 200
acx-rt benchmark --layers 6,12,24

MoE Transformer erstellen

from src import build_ramanujan_moe_transformer

model = build_ramanujan_moe_transformer(
    vocab_size=50257, d_model=768, nhead=12,
    num_layers=12, dim_feedforward=3072,
    num_experts=8, top_k=2,
    expert_dropout=0.1,             # v1.6: Expert-Dropout
    load_balancing_weight=0.05,     # v1.6: Lastausgleich-Gewicht
    use_flash_attention=True,
)

logits, aux_loss = model(input_ids, return_aux_loss=True)

HuggingFace-Kompatibilität

from src.huggingface import RamanujanGPT2, RamanujanConfig

# Modell aus Konfiguration erstellen
config = RamanujanConfig(vocab_size=50257, d_model=768, nhead=12, num_layers=12)
model = RamanujanGPT2(config)

# Speichern / Laden
model.save_pretrained('./my-ramanujan-model')
model = RamanujanGPT2.from_pretrained('./my-ramanujan-model')

Projektstruktur

ACX-ramanujan-transformer/
├── README.md
├── CHANGELOG.md
├── requirements.txt
├── src/
│   ├── ramanujan_initializer.py       # Kern: Ramanujan-Rekurrenzkoeffizienten + Initialisierer
│   ├── attention.py                   # Multi-Head-Selbstachtung + FlashAttention-3
│   ├── feedforward.py                 # FFN
│   ├── transformer_block.py           # Pre-Norm Transformer Block
│   ├── embeddings.py                  # Token- + Positions编码
│   ├── ramanujan_transformer.py       # Standard-Transformer
│   ├── moe.py                         # MoE-Transformer (vektorisiertes Routing)
│   ├── checkpoint.py                  # v1.6: Checkpoint-Manager
│   ├── logging_utils.py               # v1.6: WandB/TensorBoard-Logger
│   ├── huggingface/                   # v1.6: HuggingFace-Kompatibilität
│   │   ├── __init__.py
│   │   └── modeling_ramanujan.py
│   └── export/                        # v1.6: Modell-Export
│       ├── __init__.py
│       └── exporter.py
├── experiments/
│   ├── verify_variance.py             # Varianzerhaltungsverifikation
│   ├── benchmark.py                   # Xavier vs He vs Ramanujan
│   ├── train.py                       # Training (Mischpräzision + DDP + Fortsetzung)
│   ├── train_moe.py                   # MoE-Trainingsvergleich
│   └── visualize_variance.py          # v1.6: Visualisierungstools
├── configs/
│   ├── default.yaml                   # Standardkonfiguration
│   ├── gpt2_style.yaml                # GPT-2-Stil
│   ├── llama_7b_style.yaml            # 7B Llama-Stil
│   └── llama_13b_style.yaml           # 13B Llama-Stil
├── cli/
│   └── acx_rt.py                      # CLI-Werkzeug
├── tests/
│   └── test_core.py                   # Einheitentests
└── docs/
    └── theory.md                      # Theoretische Ableitung

Experimente

python experiments/verify_variance.py          # Varianzerhaltungsverifikation
python experiments/benchmark.py                # Benchmark-Vergleich
python experiments/train.py                    # Standardtraining
python experiments/train_moe.py                # MoE-Vergleichstraining
python experiments/visualize_variance.py       # Visualisierung

Mathematischer Hintergrund

1916 entdeckte Ramanujan diese Rekurrenzrelation beim Studium der Kleinschen j-Invariante. Die j-Invariante ist der fundamentale Erzeuger der Modulgruppe $SL(2, \mathbb{Z})$:

j(τ)=q1+744+196884q+21493760q2+j(\tau) = q^{-1} + 744 + 196884q + 21493760q^2 + \cdots

Der $\pi^2/n^2$-Koeffizient in der Rekurrenz hat eine tiefe Entsprechung mit der Struktur der Energieniveaus in der Quantenmechanik: Die Energieniveaus des Wasserstoffatoms $E_n = -13.6/n^2$ eV teilen dieselbe $1/n^2$-Struktur. Siehe docs/theory.md für die ausführliche Ableitung.

Referenzen

  1. Ramanujan, S. (1916). Notebooks of Srinivasa Ramanujan.
  2. Klein, F. (1890). Über die Transformation elfter Ordnung der elliptischen Funktionen.
  3. Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.
  4. He, K. et al. (2015). Delving deep into rectifiers.
  5. Fedus, W. et al. (2022). Switch Transformers: Scaling to Trillion Parameter Models.
  6. Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer.
  7. Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
  8. Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.
  9. Wolf, T. et al. (2020). HuggingFace's Transformers: State-of-the-Art Natural Language Processing.

Kontakt

Lizenz

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support