We provide the models used in our data curation pipeline in π Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings to assist with constructing the Surg-3M dataset (for more details about the Surg-3M dataset and our SurgFM foundation model, please visit our github repository at π€ GitHub) .
If you use our dataset, model, or code in your research, please cite our paper:
@misc{che2025surg3mdatasetfoundationmodel,
title={Surg-3M: A Dataset and Foundation Model for Perception in Surgical Settings},
author={Chengan Che and Chao Wang and Tom Vercauteren and Sophia Tsoka and Luis C. Garcia-Peraza-Herrera},
year={2025},
eprint={2503.19740},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.19740},
}
This Hugging Face repository includes video storyboard classification models, frame classification models, and non-surgical object detection models. The model loader file can be found at model_loader.py
Model | Architecture | Download | ||||
---|---|---|---|---|---|---|
Video storyboard classification models | ResNet-18 | Full ckpt | ||||
Frame classification models | ResNet-18 | Full ckpt | ||||
Non-surgical object detection models | Yolov8-Nano | Full ckpt |
The data curation pipeline leading to the clean videos in the Surg-3M dataset is as follows:

Usage
Video classification models are employed in the step 2 of the data curation pipeline to classify a video storyboard as either surgical or non-surgical, the models usage is as follows:
import torch
from PIL import Image
from model_loader import build_model
# Load the model
net = build_model(mode='classify')
model_path = 'Video storyboard classification models'
# Enable multi-GPU support
net = torch.nn.DataParallel(net)
torch.backends.cudnn.benchmark = True
state = torch.load(model_path, map_location=torch.device('cpu'))
net.load_state_dict(state['net'])
net.eval()
# Load the video storyboard and convert it to a PyTorch tensor
img_path = 'path/to/your/image.jpg'
img = Image.open(img_path)
img = img.resize((224, 224))
img_tensor = torch.tensor(np.array(img)).unsqueeze(0).to('cuda')
# Extract features from the image
outputs = net(img_tensor)
Frame classification models are used in the step 3 of the data curation pipeline to classify a frame as either surgical or non-surgical, the models usage is as follows:
import torch
from PIL import Image
from model_loader import build_model
# Load the model
net = build_model(mode='classify')
model_path = 'Frame classification models'
# Enable multi-GPU support
net = torch.nn.DataParallel(net)
torch.backends.cudnn.benchmark = True
state = torch.load(model_path, map_location=torch.device('cpu'))
net.load_state_dict(state['net'])
net.eval()
img_path = 'path/to/your/image.jpg'
img = Image.open(img_path)
img = img.resize((224, 224))
img_tensor = torch.tensor(np.array(img)).unsqueeze(0).to('cuda')
# Extract features from the image
outputs = net(img_tensor)
Non-surgical object detection models are used to obliterate the non-surgical region in the surgical frames (e.g. user interface information), the models usage is as follows:
import torch
from PIL import Image
from model_loader import build_model
# Load the model
net = build_model(mode='mask')
model_path = 'Frame classification models'
# Enable multi-GPU support
net = torch.nn.DataParallel(net)
torch.backends.cudnn.benchmark = True
state = torch.load(model_path, map_location=torch.device('cpu'))
net.load_state_dict(state['net'])
net.eval()
img_path = 'path/to/your/image.jpg'
img = Image.open(img_path)
img = img.resize((224, 224))
img_tensor = torch.tensor(np.array(img)).unsqueeze(0).to('cuda')
# Extract features from the image
outputs = net(img_tensor)