TransReID (ViT-Base) — Fine-Tuned for Person Re-Identification

Our fine-tuned TransReID model for pedestrian re-identification, trained on the Market-1501 dataset for 120 epochs. This model can be used as the appearance encoder in our RetailHeat multi-object tracking pipeline.

Model Details

Property	Value
Architecture	TransReID (ViT-Base, patch16-224)
Backbone type	`vit_base_patch16_224_TransReID`
Stride size	`[12, 12]`
JPM	Enabled
SIE camera embedding	Enabled
Embedding dim	3840-D (L2-normalized)
Input size	256 × 128 (H × W)
Training dataset	Market-1501
Training setup	Fine-tuned for 120 epochs
Task	Person re-identification

Performance

Evaluation on Market-1501 from the final 120-epoch checkpoint:

Metric	Value
Rank-1 Accuracy	95.1%
Rank-5 Accuracy	98.2%
Rank-10 Accuracy	99.0%
mAP	88.6%

Intermediate validation results during training:

Epoch	Rank-1	mAP
20	93.2%	84.5%
40	94.4%	86.6%
60	94.8%	87.6%
80	95.0%	88.0%
100	95.0%	88.5%
120	95.1%	88.6%

Architecture

TransReID is a transformer-based person re-identification model built on a ViT-Base backbone and adapted for re-ID with camera-aware side information embeddings (SIE) and the JPM module for stronger local-global feature learning. In this setup, the model uses a stride size of [12, 12], camera-aware training for 6 cameras, and produces a 3840-dimensional embedding that is L2-normalized at inference time.

Compared with lightweight CNN-based encoders, TransReID provides stronger identity discrimination and can improve appearance matching quality in crowded scenes, at the cost of higher compute.

Training

We fine-tuned this model on the Market-1501 person re-identification benchmark.

Dataset: Market-1501
Training split: 12,936 images of 751 identities
Query/Gallery: 19,732 images of 750 identities
Input resolution: 256 × 128
Optimizer: SGD
Learning rate: 0.004 in the 2-GPU training run used in the notebook
Batch size: 32 total for training, 128 for evaluation
Epochs: 120
Sampler: softmax_triplet
Loss setup: triplet metric loss
Augmentations: random horizontal flip, random erasing, padding
Pretraining: ImageNet ViT weights (jx_vit_base_p16_224-80ecf9dd.pth)

Usage

Download the weights

pip install huggingface_hub
huggingface-cli download MYerassyl/retail-heat-transreid TransReID.pth --local-dir weights/

Load in Python

import sys
from pathlib import Path
import cv2
import numpy as np
import torch

TRANSREID_REPO = Path("TransReID")
sys.path.insert(0, str(TRANSREID_REPO))

from config import cfg
from model import make_model

CONFIG_PATH = "configs/Market/vit_transreid_stride.yml"
WEIGHT_PATH = "weights/TransReID.pth"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

cfg = cfg.clone()
cfg.defrost()
cfg.merge_from_file(CONFIG_PATH)
cfg.MODEL.PRETRAIN_CHOICE = "self"
cfg.TEST.WEIGHT = WEIGHT_PATH
cfg.TEST.NECK_FEAT = "before"
cfg.TEST.FEAT_NORM = "yes"
cfg.freeze()

model = make_model(cfg, num_class=751, camera_num=6, view_num=1)
model.load_param(WEIGHT_PATH)
model.to(DEVICE).eval()

# Example crop -> embedding
img = cv2.imread("person_crop.jpg")
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (128, 256))
x = torch.from_numpy(img).permute(2, 0, 1).float() / 255.0
x = (x - 0.5) / 0.5
x = x.unsqueeze(0).to(DEVICE)

cam = torch.zeros(1, dtype=torch.long, device=DEVICE)
view = torch.zeros(1, dtype=torch.long, device=DEVICE)

with torch.no_grad():
    feat = model(x, cam_label=cam, view_label=view)
    feat = torch.nn.functional.normalize(feat, dim=1)

embedding = feat[0].cpu().numpy()
print(embedding.shape)  # (3840,)

Use with the RetailHeat pipeline

This model can be used as the appearance encoder in the BoT-SORT / RetailHeat tracking pipeline.

git clone https://github.com/MYerassyl/retail-heat.git
cd retail-heat
mkdir -p weights
huggingface-cli download MYerassyl/retail-heat-transreid TransReID.pth --local-dir weights/

Then load TransReID.pth inside your TransReID-based encoder wrapper exactly as in your notebook.

Notes

The exported checkpoint used in the notebook is TransReID.pth.
The evaluation notebook also used this model inside a BoT-SORT tracker with YOLO detections.
Since this is a transformer-based re-ID encoder, inference is heavier than OSNet but typically yields stronger appearance features.

Citation

If you use this model, please cite our RetailHeat project:

@software{retail_heat,
  author = {Yerassyl},
  title = {RetailHeat: Multi-Object Tracking and Heatmap Generation for Retail Analytics},
  url = {https://github.com/MYerassyl/retail-heat}
}

License

This model is released under the MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track