Distilled Data-efficient Image Transformer for Face Mask Detection
Distilled data-efficient Image Transformer (DeiT) model pre-trained and fine-tuned on Self Currated Custom Face-Mask18K Dataset (18k images, 2 classes) at resolution 224x224. It was first introduced in the paper Training data-efficient image transformers & distillation through attention by Touvron et al.
This model is a distilled Vision Transformer (ViT). It uses a distillation token, besides the class token, to effectively learn from a teacher (CNN) during both pre-training and fine-tuning. The distillation token is learned through backpropagation, by interacting with the class ([CLS]) and patch tokens through the self-attention layers.
Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded.
epoch = 2.0 total_flos = 2078245655GF train_loss = 0.0438 train_runtime = 1:37:16.87 train_samples_per_second = 9.887 train_steps_per_second = 0.309
epoch = 2.0 eval_accuracy = 0.9922 eval_loss = 0.0271 eval_runtime = 0:03:17.36 eval_samples_per_second = 18.22 eval_steps_per_second = 2.28
- Downloads last month