JonatanGk commited on
Commit
7da7cac
1 Parent(s): f76deec

Initial commit

Browse files
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ca
3
+ tags:
4
+ - "catalan"
5
+ metrics:
6
+ - accuracy
7
+ widget:
8
+ - text: "Ets més petita que un barrufet!!"
9
+ - text: "Ets tan lletja que et donaven de menjar per sota la porta."
10
+
11
+ ---
12
+ # roberta-base-ca-finetuned-cyberbullying-catalan
13
+
14
+ This model is a fine-tuned version of [BSC-TeMU/roberta-base-ca](https://huggingface.co/BSC-TeMU/roberta-base-ca) on the dataset generated scrapping all social networks (Twitter, Youtube ...) to detect cyberbullying on Catalan.
15
+
16
+ It achieves the following results on the evaluation set:
17
+ - Loss: 0.1508
18
+ - Accuracy: 0.9665
19
+
20
+ ## Training and evaluation data
21
+
22
+ I use the concatenation from multiple datasets generated scrapping social networks (Twitter,Youtube,Discord...) to fine-tune this model. The total number of sentence pairs is above 410k sentences. Trained similar method at [roberta-base-bne-finetuned-cyberbullying-spanish](https://huggingface.co/JonatanGk/roberta-base-bne-finetuned-cyberbullying-spanish)
23
+
24
+ ## Training procedure
25
+
26
+ <details>
27
+
28
+ ### Training hyperparameters
29
+
30
+ The following hyperparameters were used during training:
31
+ - learning_rate: 2e-05
32
+ - train_batch_size: 16
33
+ - eval_batch_size: 16
34
+ - seed: 42
35
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
36
+ - lr_scheduler_type: linear
37
+ - num_epochs: 4
38
+
39
+ </details>
40
+
41
+ ### Model in action 🚀
42
+
43
+ Fast usage with **pipelines**:
44
+
45
+ ```python
46
+ from transformers import pipeline
47
+
48
+ model_path = "JonatanGk/roberta-base-ca-finetuned-ciberbullying-catalan"
49
+ bullying_analysis = pipeline("text-classification", model=model_path, tokenizer=model_path)
50
+
51
+ bullying_analysis(
52
+ "Des que et vaig veure m'en vaig enamorar de tu."
53
+ )
54
+
55
+ # Output:
56
+ [{'label': 'Not_bullying', 'score': 0.9996786117553711}]
57
+
58
+ bullying_analysis(
59
+ "Ets tan lletja que et donaven de menjar per sota la porta."
60
+ )
61
+
62
+ # Output:
63
+ [{'label': 'Bullying', 'score': 0.9927878975868225}]
64
+
65
+ ```
66
+
67
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JonatanGk/Shared-Colab/blob/main/Cyberbullying_detection_(CATALAN).ipynb)
68
+
69
+ ### Framework versions
70
+ - Transformers 4.10.3
71
+ - Pytorch 1.9.0+cu102
72
+ - Datasets 1.12.1
73
+ - Tokenizers 0.10.3
74
+
75
+ ## Citation
76
+
77
+ ```bibtex
78
+ @inproceedings{armengol-estape-etal-2021-multilingual,
79
+ title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
80
+ author = "Armengol-Estap{\'e}, Jordi and
81
+ Carrino, Casimiro Pio and
82
+ Rodriguez-Penagos, Carlos and
83
+ de Gibert Bonet, Ona and
84
+ Armentano-Oller, Carme and
85
+ Gonzalez-Agirre, Aitor and
86
+ Melero, Maite and
87
+ Villegas, Marta",
88
+ booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
89
+ month = aug,
90
+ year = "2021",
91
+ address = "Online",
92
+ publisher = "Association for Computational Linguistics",
93
+ url = "https://aclanthology.org/2021.findings-acl.437",
94
+ doi = "10.18653/v1/2021.findings-acl.437",
95
+ pages = "4933--4946",
96
+ }
97
+ ```
98
+
99
+ > Special thx to [Manuel Romero/@mrm8488](https://huggingface.co/mrm8488) as my mentor & R.C.
100
+
101
+ > Created by [Jonatan Luna](https://JonatanGk.github.io) | [LinkedIn](https://www.linkedin.com/in/JonatanGk/)
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BSC-TeMU/roberta-base-ca",
3
+ "architectures": [
4
+ "RobertaForSequenceClassification"
5
+ ],
6
+ "id2label": {
7
+ "0": "Not_bullying",
8
+ "1": "Bullying"
9
+ },
10
+ "label2id": {
11
+ "Not_bullying": 0,
12
+ "Bullying": 1
13
+ },
14
+ "attention_probs_dropout_prob": 0.1,
15
+ "bos_token_id": 0,
16
+ "classifier_dropout": null,
17
+ "eos_token_id": 2,
18
+ "gradient_checkpointing": false,
19
+ "hidden_act": "gelu",
20
+ "hidden_dropout_prob": 0.1,
21
+ "hidden_size": 768,
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 3072,
24
+ "layer_norm_eps": 1e-05,
25
+ "max_position_embeddings": 514,
26
+ "model_type": "roberta",
27
+ "num_attention_heads": 12,
28
+ "num_hidden_layers": 12,
29
+ "pad_token_id": 1,
30
+ "position_embedding_type": "absolute",
31
+ "problem_type": "single_label_classification",
32
+ "torch_dtype": "float32",
33
+ "transformers_version": "4.11.2",
34
+ "type_vocab_size": 1,
35
+ "use_cache": true,
36
+ "vocab_size": 52000
37
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:782c9c811f47c511fc6c34a5ef24a92cbf1403883724905637026f9420ad0ee6
3
+ size 504004013
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": true, "errors": "replace", "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "max_len": 512, "special_tokens_map_file": null, "name_or_path": "BSC-TeMU/roberta-base-ca", "tokenizer_class": "RobertaTokenizer"}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61c9b71fc5afb32500a2283a9acb9a218b484c80a0e5e25484e6d5c91d2aceba
3
+ size 2991
vocab.json ADDED
The diff for this file is too large to render. See raw diff