Distilled Data-efficient Image Transformer for Face Mask Detection
Distilled data-efficient Image Transformer (DeiT) model pre-trained and fine-tuned on Self Currated Custom Face-Mask18K Dataset (18k images, 2 classes) at resolution 224x224. It was first introduced in the paper Training data-efficient image transformers & distillation through attention by Touvron et al.
Model description
This model is a distilled Vision Transformer (ViT). It uses a distillation token, besides the class token, to effectively learn from a teacher (CNN) during both pre-training and fine-tuning. The distillation token is learned through backpropagation, by interacting with the class ([CLS]) and patch tokens through the self-attention layers.
Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded.
Training Metrics
epoch = 2.0
total_flos = 2078245655GF
train_loss = 0.0438
train_runtime = 1:37:16.87
train_samples_per_second = 9.887
train_steps_per_second = 0.309
Evaluation Metrics
epoch = 2.0
eval_accuracy = 0.9922
eval_loss = 0.0271
eval_runtime = 0:03:17.36
eval_samples_per_second = 18.22
eval_steps_per_second = 2.28
- Downloads last month
- 18