Vision Transformer (base-sized model)

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.

This repo contains a Core ML version of google/vit-base-patch16-224.

Usage instructions

Create a VNCoreMLRequest that loads the ViT model:

import CoreML
import Vision

lazy var classificationRequest: VNCoreMLRequest = {
  do {
    let config = MLModelConfiguration()
    config.computeUnits = .all
    let coreMLModel = try ViT(configuration: config)
    let visionModel = try VNCoreMLModel(for: coreMLModel.model)

    let request = VNCoreMLRequest(model: visionModel, completionHandler: { [weak self] request, error in
      if let results = request.results as? [VNClassificationObservation] {
        /* do something with the results */
      }
    })

    request.imageCropAndScaleOption = .centerCrop
    return request
  } catch {
    fatalError("Failed to create VNCoreMLModel: \(error)")
  }
}()

Perform the request:

func classify(image: UIImage) {
  guard let ciImage = CIImage(image: image) else {
    print("Unable to create CIImage")
    return
  }

  DispatchQueue.global(qos: .userInitiated).async {
    let handler = VNImageRequestHandler(ciImage: ciImage, orientation: .up)
    do {
      try handler.perform([self.classificationRequest])
    } catch {
      print("Failed to perform classification: \(error)")
    }
  }
}