valhalla commited on
Commit
5975e42
1 Parent(s): 4070d42

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -10,13 +10,13 @@ January 2021
10
 
11
  ### Model Type
12
 
13
- The base model uses a ResNet50 with several modifications as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
14
 
15
  ### Model Version
16
 
17
  Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
18
 
19
- As part of the staged release process, we have also released the RN101 model, as well as RN50x4, a RN50 scaled up 4x according to the [EfficientNet](https://arxiv.org/abs/1905.11946) scaling rule.
20
 
21
  Please see the paper linked below for further details about their specification.
22
 
 
10
 
11
  ### Model Type
12
 
13
+ The base model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. There is also a variant of the model where the ResNet image encoder is replaced with a Vision Transformer.
14
 
15
  ### Model Version
16
 
17
  Initially, we’ve released one CLIP model based on the Vision Transformer architecture equivalent to ViT-B/32, along with the RN50 model, using the architecture equivalent to ResNet-50.
18
 
19
+ *This port does not include the ResNet model.*
20
 
21
  Please see the paper linked below for further details about their specification.
22