metadata
language:
- en
license: mit
tags:
- vortex
- science
- physics
- chemistry
- biology
- mathematics
- ssm
- mamba
- hybrid-architecture
- custom-tokenizer
- from-scratch
- matrix-corp
pipeline_tag: text-generation
library_name: transformers
model_type: vortex
Vortex Scientific
Vortex Scientific is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).
π Features
- Novel Architecture: Hybrid State-Space Model (SSM) + Local Attention blocks
- Science-Specialized: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
- Hardware Optimized: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
- Two Model Sizes:
- Vortex-7B: 7 billion parameters, fits in 8GB VRAM
- Vortex-13B: 13 billion parameters, fits in 16GB VRAM with quantization
- HuggingFace Compatible: Full integration with
transformerslibrary - From Scratch: No base model β everything built bottom-up including tokenizer and weights
ποΈ Architecture
Vortex uses a two-block hybrid architecture:
- SSM-Only Blocks: State-space layers for efficient long-context processing (O(n) complexity)
- Attention+Science Blocks: Local windowed attention + science modules + SciGate FFN
Layer ratios:
- 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
- 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)
Science Modules
- EquationModule: LaTeX equation detection and structural understanding
- NumericalReasoningModule: Digit-level encoding, scientific notation, unit awareness
- CitationModule: Citation span detection, provenance tracking, confidence scoring
- MolecularModule: Element embeddings, SMILES understanding, amino acid sequences
π¦ Project Structure
Vortex/
βββ configs/
β βββ vortex_7b_config.py # 7B model configuration
β βββ vortex_13b_config.py # 13B model configuration
β βββ training_config.py # Training hyperparameters
βββ models/
β βββ ssm_layer.py # State-space layer
β βββ attention_layer.py # Local windowed attention
β βββ scigate_ffn.py # Science-gated feed-forward
β βββ vortex_model.py # Main model class
β βββ science_modules/ # Specialized science modules
βββ tokenizer/
β βββ vortex_tokenizer.py # Custom BPE tokenizer with science vocab
βββ data/
β βββ dataset_loader.py # Open dataset loading (Pile, S2ORC, etc.)
β βββ quality_filter.py # Multi-stage quality filtering
β βββ domain_classifier.py # 7-domain classifier
β βββ deduplication.py # MinHash LSH deduplication
β βββ scraper.py # Web scraping (arXiv, PubMed, etc.)
βββ training/
β βββ trainer.py # Main training loop
β βββ losses.py # Science-aware loss functions
β βββ curriculum.py # Curriculum learning scheduler
βββ inference/
β βββ cuda_optimize.py # CUDA optimizations (Flash Attention, INT8)
β βββ mps_optimize.py # MPS optimizations for Apple Silicon
βββ evaluation/ # Science benchmarks (coming soon)
βββ configuration_vortex.py # HF config class
βββ tokenization_vortex.py # HF tokenizer wrapper
βββ modeling_vortex.py # HF model integration
βββ train.py # Training entry point
βββ inference/inference.py # Inference entry point
βββ requirements.txt
π Quick Start
Installation
# Clone and setup
cd Vortex
pip install -r requirements.txt
# For CUDA optimizations
pip install flash-attn
pip install bitsandbytes
Training
# Train 7B model on CUDA
python train.py \
--model_size 7b \
--device cuda \
--data_dir ./data/processed \
--output_dir ./checkpoints \
--max_steps 100000
# Train 13B model with INT8 quantization (for 8GB VRAM)
python train.py \
--model_size 13b \
--device cuda \
--quantization int8 \
--data_dir ./data/processed \
--output_dir ./checkpoints_13b
Inference
# Generate text with 7B model
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--device cuda \
--prompt "The equation E = mc^2 describes" \
--max_new_tokens 100
# Interactive mode
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--device cuda \
--interactive
# On Apple Silicon (MPS)
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--use_mps \
--prompt "Explain quantum mechanics"
HuggingFace Integration
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints")
# Generate
input_text = "The energy of a photon is given by"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
π Data Pipeline
- Open Datasets: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
- Quality Filtering: Multi-stage checks (length, language, equations, repetition, citations)
- Deduplication: MinHash LSH for near-duplicate detection
- Domain Classification: Classify into 7 science domains
- Tokenization: Custom science-aware BPE tokenizer
- Sharding: Write to Parquet with statistics
from data.dataset_loader import VortexDatasetLoader
from data.quality_filter import ScienceQualityFilter
from data.deduplication import MinHashLSH
# Load and process data
loader = VortexDatasetLoader()
quality_filter = ScienceQualityFilter()
lsh = MinHashLSH()
# Stream datasets, filter, deduplicate, and shard
for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
if quality_filter.filter(sample["text"]):
lsh.add_document(sample["id"], sample["text"])
# Tokenize and save
π― Training Strategy
Curriculum Learning
Training progresses through 4 stages:
- Foundation (0-20%): Basic science text, simple equations, definitions
- Domain (20-50%): Domain-specific deep content per science area
- Reasoning (50-80%): Scientific problem solving, multi-step derivations
- Integration (80-100%): Cross-domain science, full dataset
Science-Aware Loss
total_loss = (
lm_loss * 1.0 # Standard next token prediction
+ equation_loss * 0.3 # Equation reconstruction accuracy
+ domain_loss * 0.1 # Domain classification head
+ citation_loss * 0.1 # Citation detection accuracy
+ numerical_loss * 0.2 # Numerical reasoning accuracy
)
βοΈ Configuration
7B Config (VORTEX_7B_CONFIG)
d_model: 4096num_layers: 32num_heads: 32d_state: 16ssm_ratio: 0.6vocab_size: 50000max_seq_len: 16384
13B Config (VORTEX_13B_CONFIG)
d_model: 5120num_layers: 40num_heads: 40d_state: 32ssm_ratio: 0.5vocab_size: 50000max_seq_len: 16384
π§ Hardware Targets
Nvidia 4060 Laptop (8GB VRAM)
- 7B: BF16, no quantization, Flash Attention 2, torch.compile
- 13B: INT8 quantization, Flash Attention 2, torch.compile
- Target TPS: 25-40 (7B), 15-25 (13B)
Apple Silicon (M2/M3)
- 7B on M3: BF16 (via float16), SDPA, no compile
- 13B on M3 Max: BF16, unified memory, SDPA
- Target TPS: 20-35 (7B), 12-20 (13B)
π§ͺ Science Domains
- Physics (
[PHYS]) - Mathematics (
[MATH]) - Chemistry (
[CHEM]) - Biology (
[BIO]) - Earth Science (
[EARTH]) - Space Science (
[SPACE]) - Zoology (
[ZOO])
Domain tags can be included in training data to guide the SciGate FFN routing.
π Tokenizer
Custom BPE tokenizer with:
- 40,000 base BPE tokens trained on scientific corpus
- 10,000 science-specific tokens:
- 500 LaTeX math symbols (
\alpha,\sum,\int, etc.) - 118 chemical element symbols
- 200 SI and derived units
- 300 scientific abbreviations (DNA, RNA, ATP, etc.)
- 500 mathematical operators
- Amino acid codes
- Greek alphabet (Ξ±, Ξ², Ξ³, etc.)
- 500 LaTeX math symbols (
- Special tokens:
[EQUATION],[CITATION],[MOLECULE],[FIGURE],[TABLE], domain tags
π§ͺ Evaluation
Science benchmarks across all 7 domains will be added. Planned benchmarks:
- Physics: Feynman Questions, Physics GRE
- Math: MATH dataset, GSM8K
- Chemistry: Chemistry problem-solving, molecular property prediction
- Biology: PubMed QA, bioinformatics tasks
- Earth Science: Climate modeling questions
- Space Science: Astronomy problem sets
- Zoology: Species classification, ecological reasoning
π License
This is a school science project. Code is provided for educational purposes.
π Acknowledgments
- Mamba (Gu et al.) for SSM architecture inspiration
- Flash Attention (Dao et al.) for efficient attention
- HuggingFace for transformers library
- All open scientific data sources: arXiv, PubMed, S2ORC, etc.
π§ Contact
For questions or issues, please open an issue on GitHub.
Built with β€οΈ for scientific AI research