SimToken Setup, Data, Upload, and Download Guide

This guide is for moving the SimToken workspace between rented servers.

Assumed paths:

PROJECT_ROOT=/workspace/SimToken
SAM2_ROOT=/workspace/sam2
HF_REPO=yfan07/SimToken

1. Environment Setup

conda create -n simtoken python=3.10 -y
conda activate simtoken

conda install -c conda-forge ffmpeg libsndfile git git-lfs wget -y
git lfs install

pip install --upgrade pip setuptools wheel
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

If CUDA 12.6 wheels are unavailable, use CUDA 12.1 wheels:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Install SimToken dependencies:

pip install \
  numpy pandas matplotlib opencv-python pillow tqdm einops timm sentencepiece \
  transformers==4.30.2 peft==0.2.0 accelerate safetensors huggingface-hub \
  packaging regex requests psutil gdown

Optional, only needed if regenerating audio features:

pip install towhee towhee.models

2. Repository Download

cd /workspace
huggingface-cli login

huggingface-cli download yfan07/SimToken \
  --repo-type model \
  --local-dir /workspace/SimToken \
  --local-dir-use-symlinks False

3. Model Preparation

Hugging Face Models

mkdir -p /workspace/hf_models

huggingface-cli download openai/clip-vit-large-patch14 \
  --local-dir /workspace/hf_models/clip-vit-large-patch14 \
  --local-dir-use-symlinks False

huggingface-cli download Chat-UniVi/Chat-UniVi-7B-v1.5 \
  --local-dir /workspace/hf_models/Chat-UniVi-7B-v1.5 \
  --local-dir-use-symlinks False

SAM2 for TubeToken Proposals

Put SAM2 under /workspace/sam2:

cd /workspace
git clone https://github.com/facebookresearch/sam2.git
cd /workspace/sam2

pip install -e .

Download SAM2.1 checkpoints:

cd /workspace/sam2/checkpoints
bash download_ckpts.sh

The TubeToken Phase 0 commands use:

/workspace/sam2/checkpoints/sam2.1_hiera_large.pt
/workspace/sam2/sam2/configs/sam2.1/sam2.1_hiera_l.yaml

4. Dataset Preparation

Runtime layout:

/workspace/SimToken/data
  metadata.csv
  media/
  gt_mask/
  audio_embed/
  image_embed/

Package the four data directories:

cd /workspace/SimToken/data

tar -cf media.tar media
tar -czf gt_mask.tar.gz gt_mask
tar -czf audio_embed.tar.gz audio_embed
tar -cf image_embed.tar image_embed

Restore the four data directories:

cd /workspace/SimToken/data

tar -xf media.tar
tar -xzf gt_mask.tar.gz
tar -xzf audio_embed.tar.gz
tar -xf image_embed.tar

5. Upload Repository

The remote repo stores the four large data directories as tar archives (media.tar, image_embed.tar, etc.). The local workspace has them extracted as plain directories. Do not re-upload these directories—use --ignore-patterns to skip them, otherwise every extracted file would be treated as a new upload.

5a. Pack any new data directories before uploading

If data/text_embed/ is new (first upload after running precompute_text_feats.py):

cd /workspace/SimToken/data
tar -cf text_embed.tar text_embed

5b. Login

cd /workspace/SimToken
huggingface-cli login

5c. Upload (excluding extracted data directories)

Use the new hf upload command (not the deprecated huggingface-cli upload). The deprecated command hashes all files before applying any filter, which is extremely slow with large data directories. hf upload with --exclude skips the specified files before hashing.

hf upload yfan07/SimToken . . \
  --repo-type model \
  --exclude "data/media/**" "data/gt_mask/**" "data/audio_embed/**" "data/image_embed/**" "data/text_embed/**" \
  2>&1 | tee upload.log

This uploads everything except the four extracted dataset directories and the raw text_embed/ folder. The data/text_embed.tar file (sitting directly under data/) is not matched by data/text_embed/** and will be uploaded normally.

Restore on a new server

After downloading the repo (Section 2), extract all packed data:

cd /workspace/SimToken/data
tar -xf media.tar
tar -xzf gt_mask.tar.gz
tar -xzf audio_embed.tar.gz
tar -xf image_embed.tar
tar -xf text_embed.tar      # if present

6. Current Experiment Files to Preserve

Keep these files and directories for continuing TubeToken experiments:

runs/tubetoken_phase_minus1/audit_full
runs/tubetoken_phase_minus1/simtoken_eval
runs/tubetoken_phase0/proposals_stride8_n64_bidir
runs/tubetoken_phase0/eval_stride8_n64_bidir
runs/tubetoken_phase0/miss_videos_r64.txt
TubeToken_Phase0_Experiment_Log.md
TubeToken_Experiment_Plan_v4_Final.md