DanielJunior augusnunes commited on
Commit
50fd69b
0 Parent(s):

Duplicate from ulysses-camara/legal-bert-pt-br

Browse files

Co-authored-by: Augusto Nunes <augusnunes@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ftz filter=lfs diff=lfs merge=lfs -text
6
+ *.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.h5 filter=lfs diff=lfs merge=lfs -text
8
+ *.joblib filter=lfs diff=lfs merge=lfs -text
9
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
10
+ *.model filter=lfs diff=lfs merge=lfs -text
11
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
12
+ *.onnx filter=lfs diff=lfs merge=lfs -text
13
+ *.ot filter=lfs diff=lfs merge=lfs -text
14
+ *.parquet filter=lfs diff=lfs merge=lfs -text
15
+ *.pb filter=lfs diff=lfs merge=lfs -text
16
+ *.pt filter=lfs diff=lfs merge=lfs -text
17
+ *.pth filter=lfs diff=lfs merge=lfs -text
18
+ *.rar filter=lfs diff=lfs merge=lfs -text
19
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
20
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
21
+ *.tflite filter=lfs diff=lfs merge=lfs -text
22
+ *.tgz filter=lfs diff=lfs merge=lfs -text
23
+ *.wasm filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
0_Transformer/config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "neuralmind/bert-base-portuguese-cased",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "directionality": "bidi",
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "output_past": true,
20
+ "pad_token_id": 0,
21
+ "pooler_fc_size": 768,
22
+ "pooler_num_attention_heads": 12,
23
+ "pooler_num_fc_layers": 3,
24
+ "pooler_size_per_head": 128,
25
+ "pooler_type": "first_token_transform",
26
+ "position_embedding_type": "absolute",
27
+ "transformers_version": "4.6.1",
28
+ "type_vocab_size": 2,
29
+ "use_cache": true,
30
+ "vocab_size": 29794
31
+ }
0_Transformer/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:37bf2ba5b420399e84e94e109f5ca37f65a464b4fa6d95468a9a616bc5fd04eb
3
+ size 435776311
0_Transformer/sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 32,
3
+ "do_lower_case": false
4
+ }
0_Transformer/special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
0_Transformer/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
0_Transformer/tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": "/root/.cache/huggingface/transformers/eecc45187d085a1169eed91017d358cc0e9cbdd5dc236bcd710059dbf0a2f816.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d", "name_or_path": "neuralmind/bert-base-portuguese-cased", "do_basic_tokenize": true, "never_split": null}
0_Transformer/vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pt
3
+ license: mit
4
+ tags:
5
+ - sentence-transformers
6
+ duplicated_from: ulysses-camara/legal-bert-pt-br
7
+ ---
8
+
9
+ # LegalBERTPT-br
10
+
11
+ LegalBERTPT-br is a trained sentence embedding using SimCSE, a contrastive learning framework, coupled with the Portuguese pre-trained language model named [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased).
12
+
13
+
14
+ # Corpora
15
+
16
+ – From [this site](https://www2.camara.leg.br/transparencia/servicos-ao-cidadao/participacao-popular), we used the column `Conteudo` with 215,713 comments. We removed the comments from PL 3723/2019, PEC 471/2005, and Hashtag Corpus, in order to avoid bias.
17
+
18
+ – From [this site](https://www2.camara.leg.br/transparencia/servicos-ao-cidadao/participacao-popular), we also used 147,008 bills. From these projects, we used the summary field named `txtEmenta` and the project core text named `txtExplicacaoEmenta`.
19
+
20
+ – From Political Speeches, we used 462,831 texts, specifically, we used the columns: `sumario`, `textodiscurso`, and `indexacao`.
21
+
22
+ These corpora were segmented into sentences and concatenated, producing 2,307,426 sentences.
23
+
24
+
25
+ # Citing and Authors
26
+
27
+ If you find this model helpful, feel free to cite our publication [Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies](https://link.springer.com/chapter/10.1007/978-3-030-91699-2_8):
28
+ ```bibtex
29
+ @inproceedings{bracis,
30
+ author = {Nádia Silva and Marília Silva and Fabíola Pereira and João Tarrega and João Beinotti and Márcio Fonseca and Francisco Andrade and André Carvalho},
31
+ title = {Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies},
32
+ booktitle = {Anais da X Brazilian Conference on Intelligent Systems},
33
+ location = {Online},
34
+ year = {2021},
35
+ keywords = {},
36
+ issn = {0000-0000},
37
+ publisher = {SBC},
38
+ address = {Porto Alegre, RS, Brasil},
39
+ url = {https://sol.sbc.org.br/index.php/bracis/article/view/19061}
40
+ }
41
+ ```
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "__version__": "1.2.0"
3
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "0_Transformer",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]