Instructions to use grKnight/astrollava-stage2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use grKnight/astrollava-stage2 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
AstroLLaVA Stage-2 (connector + LoRA instruction tuning)
A LLaVA-style vision–language model that lets Qwen2.5-1.5B-Instruct answer questions about
astronomy images encoded by CLIP ViT-L/14. This is the Stage-2 model: it warm-starts the
Stage-1 connector and continues training it
jointly with LoRA adapters on the Qwen LLM, on the caption + GPT-4 QA records of
UniverseTBD/AstroLLaVA_convos.
The CLIP vision tower stays frozen. Trained on a disjoint held-out test split so it can be
evaluated on unseen images.
Stage 1 aligned the connector with the LLM frozen — it grounds coarse visual structure but hallucinates fine specifics. Stage 2 opens up the LLM (via LoRA) so the model learns to use the visual evidence when committing to answers — the recipe's instruction-tuning step.
⚠️ This bundle ships the connector + LoRA adapter only (not full LLM weights). It is not a standalone
transformersmodel — it needs the custom VLM code from the astronomy-vlm repo, the two base models (auto-downloaded from the Hub), andpeftto run.
Download
A single bundle holds the final checkpoint and everything needed to run / reproduce it:
| Bundle | Contents |
|---|---|
astrollava-stage2.zip |
checkpoint-2526/ (connector.safetensors + lora/), predictions_test_stage2.jsonl, finetune_astrollava_stage2.yaml, test.json, REPRODUCE.md |
checkpoint-2526/ contains the continued-trained connector (connector.safetensors), the trained
LoRA adapter (lora/adapter_model.safetensors + adapter_config.json), optimizer/scheduler state
(training_state.pt), and meta.json (step + final loss). Both the connector and the LoRA are
required at inference.
Architecture
image ─► CLIP ViT-L/14 (FROZEN) ─► MLP connector (TRAINED, init from Stage-1) ─► Qwen2.5-1.5B + LoRA (base FROZEN, LoRA TRAINED) ─► text
1024 → 1536 → 1536
- Vision:
openai/clip-vit-large-patch14, penultimate-layer patch features (frozen) - Connector: 2-layer MLP with GELU, 1024→1536→1536; warm-started from Stage-1
checkpoint-3789and kept trainable - LLM:
Qwen/Qwen2.5-1.5B-Instruct, base frozen + LoRA adapters (r=16,α=32, dropout 0.05) onq/k/v/o/gate/up/down_projacross all 28 layers - Trainable / total: 22,400,000 / 1,868,879,360 (1.20%) — connector 3,935,232 + LoRA 18,464,768
Training
| Data | UniverseTBD/AstroLLaVA_convos, same per-image held-out split as Stage-1: train 161,653 recs / 29,151 imgs, test 591 imgs / 3,271 recs |
| Initialization | connector ← Stage-1 checkpoint-3789 (epoch 3); LoRA ← fresh (no-op init) |
| Objective | next-token cross-entropy on answer tokens only (connector + LoRA trainable) |
| Epochs / steps | 1 epoch, 2,526 update steps |
| Effective batch | 64 (per-device 4 × grad-accum 16) |
| LR / schedule | 2e-4, cosine with 3% warmup (75 steps) |
| Max length | 512 (+256 image tokens) |
| Precision | bf16 (autocast) + gradient checkpointing |
| Hardware | 1× RTX 6000 Ada (48 GB), |
| Loss | began ~1.47 (equal to the Stage-1 warm-start, since LoRA initializes as a no-op); final value in checkpoint-2526/meta.json |
The full-LLM backward pass (absent in Stage-1) is the memory driver, hence per-device batch 4 + gradient checkpointing to fit ~48 GB. One epoch is the LLaVA instruction-tuning convention — the model only needs to learn to use the already-aligned visual features, not to align them from scratch.
Usage
# 1. get the code
git clone https://github.com/crimsonKn1ght/astronomy-vlm && cd astronomy-vlm
pip install -r requirements.txt # includes peft
# 2. download + unzip the bundle
hf download grKnight/astrollava-stage2 astrollava-stage2.zip --local-dir .
unzip astrollava-stage2.zip -d ckpt2
# 3. answer a question about an image (CLIP + Qwen auto-download; peft loads the LoRA)
python inference.py \
--config ckpt2/finetune_astrollava_stage2.yaml \
--checkpoint ckpt2/checkpoint-2526 \
--image your_astro_image.jpg \
--prompt "What type of object is this and what is notable about it?" \
--temperature 0
Pass the Stage-2 config so the LoRA modules are built before the adapter weights load; the loader
then restores both the connector and the LoRA automatically. The bundled
predictions_test_stage2.jsonl holds the held-out outputs with their reference captions.
Capabilities & limitations
Stage 2 fine-tunes the LLM (LoRA) jointly with the connector, so — unlike Stage-1 — the language
model itself learns from the QA pairs rather than improvising specifics from its frozen prior. The
intended effect is fewer hallucinated fine details (catalog numbers, instruments, dates) on
question-answering prompts, on top of Stage-1's coarse visual grounding. Compare the bundled
predictions_test_stage2.jsonl with Stage-1's predictions_test_ep3.jsonl (held out, same images)
to see the difference.
Limitations carried over from the design: CLIP's 224×224 input discards fine astronomical detail; the base LLM is small (1.5B); and LoRA is a low-rank adaptation, not a full fine-tune. Evaluation is a held-out generation set, not a full quantitative benchmark — read results qualitatively.
Reproduction
The bundle's REPRODUCE.md pins the exact code commit, base models, the seeded dataset-build
command, the training command, and package versions (torch, transformers, peft). The split is
seeded, so the build reproduces the exact train/test partition.
prereq: Stage-1 connector checkpoint-3789 (grKnight/astrollava-stage1 ep3 bundle)
build: python scripts/build_astrollava_trainset.py --include-qa --max-image-size 384 --test-fraction 0.02 --seed 42
train: python train.py --config configs/finetune_astrollava_stage2.yaml
eval: python scripts/batch_inference.py --config configs/finetune_astrollava_stage2.yaml --records-json datasets/astrollava_llava/test.json --num-samples 0 ...
License & attribution
- Weights:
cc-by-sa-4.0, inherited from the training data. - Training data:
UniverseTBD/AstroLLaVA_convos(CC-BY-SA-4.0); imagery from NASA APOD, ESO, and NASA/ESA Hubble. - Base models: Qwen2.5-1.5B-Instruct (Apache-2.0), CLIP ViT-L/14 (OpenAI, MIT).
- Builds on: AstroLLaVA Stage-1 and the AstroLLaVA work (arXiv:2504.08583).
- Downloads last month
- -