Getting error while trying this model
Throwing below error
Error(s) in loading state_dict for BertModel: size mismatch for bert.embeddings.word_embeddings.weight: copying a param with shape torch.Size([39628, 256]) from checkpoint, the shape in current model is torch.Size([30000, 768]). size mismatch for bert.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 256]) from checkpoint, the shape in current model is torch.Size([512, 768]). size mismatch for bert.embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([2, 256]) from checkpoint, the shape in current model is torch.Size([2, 768]). size mismatch for bert.embeddings.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.embeddings.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.0.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.0.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.0.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.0.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.0.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.0.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.0.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.0.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.0.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.0.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.1.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.1.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.1.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.1.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.1.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.1.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.1.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.1.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.1.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.1.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.1.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.1.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.1.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.1.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.2.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.2.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.2.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.2.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.2.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.2.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.2.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.2.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.2.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.2.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.2.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.2.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.2.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.2.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.3.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.3.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.3.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.3.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.3.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.3.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.3.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.3.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.3.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.3.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.3.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.3.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.3.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.3.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.4.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.4.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.4.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.4.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.4.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.4.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.4.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.4.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.4.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.4.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.4.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.4.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.4.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.4.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.5.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.5.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.5.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.5.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.5.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.5.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.5.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.5.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.5.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.5.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.5.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.5.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.5.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.5.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.6.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.6.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.6.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.6.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.6.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.6.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.6.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.6.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.6.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.6.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.6.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.6.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.6.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.6.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.6.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.6.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.7.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.7.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.7.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.7.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.7.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.7.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.7.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.7.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.7.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.7.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.7.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.7.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.7.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.7.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.7.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.7.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.8.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.8.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.8.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.8.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.8.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.8.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.8.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.8.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.8.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.8.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.8.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.8.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.8.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.8.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.8.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.8.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.9.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.9.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.9.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.9.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.9.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.9.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.9.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.9.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.9.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.9.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.9.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.9.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.9.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.9.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.9.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.9.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.10.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.10.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.10.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.10.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.10.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.10.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.10.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.10.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.10.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.10.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.10.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.10.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.10.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.10.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.10.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.10.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.11.attention.self.query.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.11.attention.self.query.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.11.attention.self.key.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.11.attention.self.key.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.11.attention.self.value.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.11.attention.self.value.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.11.attention.output.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.encoder.layer.11.attention.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.11.attention.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.11.attention.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.11.intermediate.dense.weight: copying a param with shape torch.Size([1024, 256]) from checkpoint, the shape in current model is torch.Size([3072, 768]). size mismatch for bert.encoder.layer.11.intermediate.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([3072]). size mismatch for bert.encoder.layer.11.output.dense.weight: copying a param with shape torch.Size([256, 1024]) from checkpoint, the shape in current model is torch.Size([768, 3072]). size mismatch for bert.encoder.layer.11.output.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.11.output.LayerNorm.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.encoder.layer.11.output.LayerNorm.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). size mismatch for bert.pooler.dense.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([768, 768]). size mismatch for bert.pooler.dense.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([768]). You may consider adding
ignore_mismatched_sizes=Truein the model
from_pretrained method.
BertModel.from_pretrained('/content/pytorch', ignore_mismatched_sizes=True)