1
---
2
language:
3
- hi
4
- en
5
tags:
6
- hi
7
- en
8
- codemix
9
license: "apache-2.0"
10
datasets:
11
- SAIL 2017
12
metrics:
13
- fscore
14
- accuracy
15
---
16
17
# BERT codemixed base model for hinglish (cased)
18
19
## Model description
20
21
Input for the model: Any codemixed hinglish text
22
Output for the model: Sentiment. (0 - Negative, 1 - Neutral, 2 - Positive)
23
24
I took a bert-base-multilingual-cased model from Huggingface and finetuned it on [SAIL 2017](http://www.dasdipankar.com/SAILCodeMixed.html) dataset.  
25
26
Performance of this model on the SAIL 2017 dataset
27
28
| metric     |    score |
29
|------------|----------|
30
| acc        | 0.588889 |
31
| f1         | 0.582678 |
32
| acc_and_f1 | 0.585783 |
33
| precision  | 0.586516 |
34
| recall     | 0.588889 |
35
36
## Intended uses & limitations
37
38
#### How to use
39
40
Here is how to use this model to get the features of a given text in *PyTorch*:
41
42
```python
43
# You can include sample code which will be formatted
44
from transformers import BertTokenizer, BertModelForSequenceClassification
45
tokenizer = AutoTokenizer.from_pretrained("rohanrajpal/bert-base-codemixed-uncased-sentiment")
46
model = AutoModelForSequenceClassification.from_pretrained("rohanrajpal/bert-base-codemixed-uncased-sentiment")
47
text = "Replace me by any text you'd like."
48
encoded_input = tokenizer(text, return_tensors='pt')
49
output = model(**encoded_input)
50
```
51
52
and in *TensorFlow*:
53
54
```python
55
from transformers import BertTokenizer, TFBertModel
56
tokenizer = BertTokenizer.from_pretrained('rohanrajpal/bert-base-codemixed-uncased-sentiment')
57
model = TFBertModel.from_pretrained("rohanrajpal/bert-base-codemixed-uncased-sentiment")
58
text = "Replace me by any text you'd like."
59
encoded_input = tokenizer(text, return_tensors='tf')
60
output = model(encoded_input)
61
```
62
63
#### Limitations and bias
64
65
Coming soon!
66
67
## Training data
68
69
I trained on the SAIL 2017 dataset [link](http://amitavadas.com/SAIL/Data/SAIL_2017.zip) on this [pretrained model](https://huggingface.co/bert-base-multilingual-cased).
70
71
## Training procedure
72
73
No preprocessing.
74
75
## Eval results
76
77
### BibTeX entry and citation info
78
79
```bibtex
80
@inproceedings{khanuja-etal-2020-gluecos,
81
    title = "{GLUEC}o{S}: An Evaluation Benchmark for Code-Switched {NLP}",
82
    author = "Khanuja, Simran  and
83
      Dandapat, Sandipan  and
84
      Srinivasan, Anirudh  and
85
      Sitaram, Sunayana  and
86
      Choudhury, Monojit",
87
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
88
    month = jul,
89
    year = "2020",
90
    address = "Online",
91
    publisher = "Association for Computational Linguistics",
92
    url = "https://www.aclweb.org/anthology/2020.acl-main.329",
93
    pages = "3575--3585"
94
}
95
```
96