rufimelo commited on
Commit
d23bf0f
1 Parent(s): 3bc94d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -1
README.md CHANGED
@@ -1,3 +1,112 @@
1
  ---
2
- license: mit
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - pt
4
+ thumbnail: "Portuguese BERT for the Legal Domain"
5
+ tags:
6
+ - sentence-transformers
7
+ - transformers
8
  ---
9
+
10
+ # stjiris/bert-large-portuguese-cased-legal-mlm (Legal BERTimbau)
11
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
12
+ stjiris/bert-large-portuguese-cased-legal-mlm derives from [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) large.
13
+
14
+ It was trained using the MLM technique with a learning rate 1e-5 [Legal Sentences from +-30000 documents](https://huggingface.co/datasets/stjiris/portuguese-legal-sentences-v1.0) 5300 training steps (best performance for our semantic search system implementation)
15
+
16
+
17
+ ## Usage (Sentence-Transformers)
18
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
19
+ ```
20
+ pip install -U sentence-transformers
21
+ ```
22
+ Then you can use the model like this:
23
+ ```python
24
+ from sentence_transformers import SentenceTransformer
25
+ sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]
26
+
27
+ model = SentenceTransformer('stjiris/bert-large-portuguese-cased-legal-mlm')
28
+ embeddings = model.encode(sentences)
29
+ print(embeddings)
30
+ ```
31
+ ## Usage (HuggingFace Transformers)
32
+ ```python
33
+ from transformers import AutoTokenizer, AutoModel
34
+ import torch
35
+
36
+
37
+ #Mean Pooling - Take attention mask into account for correct averaging
38
+ def mean_pooling(model_output, attention_mask):
39
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
40
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
41
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
42
+
43
+ # Sentences we want sentence embeddings for
44
+ sentences = ['This is an example sentence', 'Each sentence is converted']
45
+
46
+ # Load model from HuggingFace Hub
47
+ tokenizer = AutoTokenizer.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm')
48
+ model = AutoModel.from_pretrained('stjiris/bert-large-portuguese-cased-legal-mlm')
49
+
50
+ # Tokenize sentences
51
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
52
+
53
+ # Compute token embeddings
54
+ with torch.no_grad():
55
+ model_output = model(**encoded_input)
56
+ # Perform pooling. In this case, mean pooling.
57
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
58
+ print("Sentence embeddings:")
59
+ print(sentence_embeddings)
60
+ ```
61
+
62
+
63
+ ## Full Model Architecture
64
+ ```
65
+ SentenceTransformer(
66
+ (0): Transformer({'max_seq_length': 514, 'do_lower_case': False}) with Transformer model: BertModel
67
+ (1): Pooling({'word_embedding_dimension': 1028, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
68
+ )
69
+ ```
70
+ ## Citing & Authors
71
+
72
+ If you use this work, please cite:
73
+
74
+ ```bibtex
75
+ @inproceedings{MeloSemantic,
76
+ author = {Melo, Rui and Santos, Professor Pedro Alexandre and Dias, Professor Jo{\~ a}o},
77
+ title = {A {Semantic} {Search} {System} for {Supremo} {Tribunal} de {Justi}{\c c}a},
78
+ }
79
+
80
+ @inproceedings{souza2020bertimbau,
81
+ author = {F{\'a}bio Souza and
82
+ Rodrigo Nogueira and
83
+ Roberto Lotufo},
84
+ title = {{BERT}imbau: pretrained {BERT} models for {B}razilian {P}ortuguese},
85
+ booktitle = {9th Brazilian Conference on Intelligent Systems, {BRACIS}, Rio Grande do Sul, Brazil, October 20-23 (to appear)},
86
+ year = {2020}
87
+ }
88
+
89
+ @inproceedings{fonseca2016assin,
90
+ title={ASSIN: Avaliacao de similaridade semantica e inferencia textual},
91
+ author={Fonseca, E and Santos, L and Criscuolo, Marcelo and Aluisio, S},
92
+ booktitle={Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal},
93
+ pages={13--15},
94
+ year={2016}
95
+ }
96
+
97
+ @inproceedings{real2020assin,
98
+ title={The assin 2 shared task: a quick overview},
99
+ author={Real, Livy and Fonseca, Erick and Oliveira, Hugo Goncalo},
100
+ booktitle={International Conference on Computational Processing of the Portuguese Language},
101
+ pages={406--412},
102
+ year={2020},
103
+ organization={Springer}
104
+ }
105
+ @InProceedings{huggingface:dataset:stsb_multi_mt,
106
+ title = {Machine translated multilingual STS benchmark dataset.},
107
+ author={Philip May},
108
+ year={2021},
109
+ url={https://github.com/PhilipMay/stsb-multi-mt}
110
+ }
111
+
112
+ ```