Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

Dockerfile +45 -0
GPU_AND_SLURM_CONFIG.md +70 -0
MANIFEST.txt +21 -0
README.md +63 -0
SHA256SUMS +12 -0
skillzero_best_checkpoints_20260506.tar.part-aa +3 -0
skillzero_best_checkpoints_20260506.tar.part-ab +3 -0
skillzero_best_checkpoints_20260506.tar.part-ac +3 -0
skillzero_best_checkpoints_20260506.tar.part-ad +3 -0
skillzero_best_checkpoints_20260506.tar.part-ae +3 -0
upload_with_hf.sh +22 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,45 @@

+FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
+ENV DEBIAN_FRONTEND=noninteractive \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    ca-certificates \
+    curl \
+    git \
+    wget \
+    libgl1 \
+    libglib2.0-0 \
+    libsm6 \
+    libxext6 \
+    libxrender1 \
+    && rm -rf /var/lib/apt/lists/*
+RUN wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh \
+    && bash /tmp/miniconda.sh -b -p /opt/conda \
+    && rm /tmp/miniconda.sh
+ENV PATH=/opt/conda/bin:$PATH
+RUN conda create -y -n skillzero python=3.12 && conda clean -afy
+SHELL ["conda", "run", "-n", "skillzero", "/bin/bash", "-c"]
+WORKDIR /workspace/SkillZero
+# Copy the repository into the image, or bind-mount it at runtime.
+# Example build context should be the SkillZero repo root:
+#   docker build -f export_packages/skillzero_best_checkpoints_20260506/Dockerfile -t skillzero:export .
+COPY . /workspace/SkillZero
+RUN pip install -U pip setuptools wheel \
+    && pip install -r requirements.txt \
+    && pip install -e . \
+    && pip install flash-attn==2.7.4.post1 --no-build-isolation
+ENV HF_HOME=/workspace/hf \
+    HUGGINGFACE_HUB_CACHE=/workspace/hf/hub \
+    HYDRA_FULL_ERROR=1
+CMD ["/bin/bash"]

GPU_AND_SLURM_CONFIG.md ADDED Viewed

	@@ -0,0 +1,70 @@

+# GPU and Slurm Configuration
+## ALFWorld Best Checkpoints
+The best ALFWorld checkpoints were trained with:
+```bash
+#SBATCH -p a100
+#SBATCH --gres=gpu:4
+#SBATCH --cpus-per-task=32
+#SBATCH --mem=200G
+#SBATCH -t 2-00:00:00
+```
+Important training overrides:
+```bash
+trainer.n_gpus_per_node=4
+trainer.nnodes=1
+trainer.total_training_steps=180
+trainer.save_freq=10
+trainer.test_freq=10
+env.env_name=alfworld/AlfredTWEnv
+env.rollout.n=4
+data.train_batch_size=8
+data.val_batch_size=16
+actor_rollout_ref.rollout.gpu_memory_utilization=0.4
+actor_rollout_ref.rollout.max_model_len=3072
+```
+## Search Run
+The Search run used one node with 4 A100 GPUs allocated:
+```bash
+#SBATCH -p a100
+#SBATCH --gres=gpu:4
+#SBATCH --cpus-per-task=32
+#SBATCH --mem=220G
+#SBATCH -t 2-00:00:00
+```
+GPU assignment:
+```bash
+CUDA_VISIBLE_DEVICES=3  # local retriever service
+CUDA_VISIBLE_DEVICES=0,1,2  # training
+```
+Important Search fix:
+```bash
+data.max_prompt_length=6144
+actor_rollout_ref.rollout.max_model_len=6144
+```
+This avoids the observed Qwen2-VL RoPE shape mismatch where generated prompt state exceeded 4096 tokens.
+## Docker Runtime
+Suggested runtime command:
+```bash
+docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
+  -v /path/to/SkillZero:/workspace/SkillZero \
+  -v /path/to/checkpoints:/workspace/SkillZero/checkpoints \
+  -it skillzero:export
+```
+For Slurm clusters, prefer running through the provided Slurm scripts rather than plain Docker unless the cluster explicitly supports Docker or Enroot/Singularity.

MANIFEST.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+SkillZero best checkpoint export manifest
+Selected by validation success rate:
+1. checkpoints/SkillZero_alfworld/skillzero_alfworld_vl_3b_safe/global_step_160
+   val/success_rate: 0.594
+   included archive path: checkpoints/SkillZero_alfworld/skillzero_alfworld_vl_3b_safe/global_step_160
+2. checkpoints/SkillZero_alfworld/skillzero_alfworld_vl_3b_safe/global_step_150
+   val/success_rate: 0.477
+   included archive path: checkpoints/SkillZero_alfworld/skillzero_alfworld_vl_3b_safe/global_step_150
+Metadata included:
+- README.md
+- Dockerfile
+- GPU_AND_SLURM_CONFIG.md
+- MANIFEST.txt
+Not included by default:
+- Search best checkpoint global_step_180, because it is third by overall validation success rate and about 44GB.
+  Include it separately if you want per-task best checkpoints instead of top-two-overall checkpoints.

README.md ADDED Viewed

	@@ -0,0 +1,63 @@

+# SkillZero Best Checkpoints Export
+This package contains the two best checkpoints selected by validation success rate from the completed SkillZero runs.
+## Included Checkpoints
+1. ALFWorld `global_step_160`
+   - Validation metric: `val/success_rate = 0.594`
+   - Archive path:
+     `checkpoints/SkillZero_alfworld/skillzero_alfworld_vl_3b_safe/global_step_160`
+2. ALFWorld `global_step_150`
+   - Validation metric: `val/success_rate = 0.477`
+   - Archive path:
+     `checkpoints/SkillZero_alfworld/skillzero_alfworld_vl_3b_safe/global_step_150`
+## Related Search Checkpoint
+The best Search checkpoint by validation success rate is not included in the "top two overall" package, but is useful for reproducing the Search run:
+- Search `global_step_180`
+- Validation metric: `val/success_rate = 0.356`
+- Test metrics:
+  - `test/full_skill/success_rate = 0.282`
+  - `test/no_skill/success_rate = 0.310`
+- Checkpoint path, if packaged separately:
+  `checkpoints/SkillZero_search/skillzero_search_vl_3b_local_retriever/global_step_180`
+## Hardware Used
+Training was submitted through Slurm on the `a100` partition.
+- ALFWorld:
+  - GPUs: 4 x A100
+  - CPUs per task: 32
+  - Memory: 200GB
+  - Time limit: 2 days
+- Search local retriever:
+  - GPUs: 4 x A100 allocated
+  - Training used GPUs 0,1,2
+  - Local retriever used GPU 3
+  - CPUs per task: 32
+  - Memory: 220GB
+  - Time limit: 2 days
+## Runtime Notes
+- Python environment name used on the cluster: `skillzero`
+- Retriever environment name: `retriever`
+- Main model: `Qwen/Qwen2.5-VL-3B-Instruct`
+- Training entry point: `python3 -m verl.trainer.main_ppo`
+- Original training logs are not required to use the checkpoints.
+## Restore
+After extracting the archive, place checkpoint directories under:
+```bash
+checkpoints/SkillZero_alfworld/skillzero_alfworld_vl_3b_safe/
+```
+Then use `trainer.resume_mode=resume_path` and set `trainer.resume_from_path` to the target `global_step_*` directory.

SHA256SUMS ADDED Viewed

	@@ -0,0 +1,12 @@

+f7107194b12d4a0910d2ea372126316220b80b232bd65746748369c0a594f3b8  skillzero_best_checkpoints_20260506.tar.part-aa
+d88b0449da962144344b3ef0a274baf8b25d0494791b37e0ee37fd7fa68662b7  skillzero_best_checkpoints_20260506.tar.part-ab
+d5936d47310a3b83f05535bb864ab14c38c4f852512efa33e4be09f6722c101f  skillzero_best_checkpoints_20260506.tar.part-ac
+1bfb853234689307c2557fb35610be408795f351a85e57dc138bdd73f57e001b  skillzero_best_checkpoints_20260506.tar.part-ad
+5c4b2dce0b6cd71babc55b0ac73982924e737fd148b19f8115d44a361ddcc68c  skillzero_best_checkpoints_20260506.tar.part-ae
+befb32fc0d7eb24186e4accb824a74859e6662082021c8f95cebc3a8bb81b1eb  README.md
+863167a1dd18ac713239896c0d97176a800e161b76bc722e6a01ab7c44f54e8a  Dockerfile
+24e977d62c458f5dfc350b3df115fd0d6092229a15a096f9444bd13a22b802e7  GPU_AND_SLURM_CONFIG.md
+cbc3d95d90685b3f546f67d6aad220ccb49463d3f12d3cbeb0d439574fa03204  MANIFEST.txt
+409d2bd016bb8dd376a5b04b1719d1736b87882e36563c7f6da59e43bfc03ab0  GOOGLE_DRIVE_UPLOAD.md
+1f4b88433555db3c299077b01946c04803c3efca87f915ca04ec253e635d5971  upload_with_rclone.sh
+d3ed06023bd0c2b47612568029e720c8830c0e8c6667f69a145277e5c2d0388b  upload_with_hf.sh

skillzero_best_checkpoints_20260506.tar.part-aa ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7107194b12d4a0910d2ea372126316220b80b232bd65746748369c0a594f3b8
+size 21474836480

skillzero_best_checkpoints_20260506.tar.part-ab ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d88b0449da962144344b3ef0a274baf8b25d0494791b37e0ee37fd7fa68662b7
+size 21474836480

skillzero_best_checkpoints_20260506.tar.part-ac ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d5936d47310a3b83f05535bb864ab14c38c4f852512efa33e4be09f6722c101f
+size 21474836480

skillzero_best_checkpoints_20260506.tar.part-ad ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1bfb853234689307c2557fb35610be408795f351a85e57dc138bdd73f57e001b
+size 21474836480

skillzero_best_checkpoints_20260506.tar.part-ae ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5c4b2dce0b6cd71babc55b0ac73982924e737fd148b19f8115d44a361ddcc68c
+size 6737141760

upload_with_hf.sh ADDED Viewed

	@@ -0,0 +1,22 @@

+#!/usr/bin/env bash
+set -euo pipefail
+REPO_ID="${1:?Usage: $0 <namespace/repo-name> [repo-type]}"
+REPO_TYPE="${2:-model}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+if ! command -v hf >/dev/null 2>&1; then
+  echo "hf command is not available. Install with: pip install -U huggingface_hub" >&2
+  exit 1
+fi
+cd "$SCRIPT_DIR"
+hf upload "$REPO_ID" . \
+  --repo-type "$REPO_TYPE" \
+  --include "skillzero_best_checkpoints_20260506.tar.part-*" \
+  --include "SHA256SUMS" \
+  --include "*.md" \
+  --include "Dockerfile" \
+  --include "MANIFEST.txt" \
+  --include "upload_with_hf.sh"