File size: 7,875 Bytes
3a039f0
e2f11ac
8604307
3a039f0
249fcc5
 
 
 
155b2c4
 
3a039f0
 
 
4b3cb36
a340739
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155b2c4
 
 
 
a340739
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155b2c4
a340739
155b2c4
a340739
 
155b2c4
a340739
 
155b2c4
a340739
 
155b2c4
a340739
 
155b2c4
 
a340739
 
155b2c4
a340739
 
 
155b2c4
 
a340739
 
 
 
 
 
 
 
 
 
 
 
155b2c4
a340739
 
 
 
 
 
 
 
 
 
 
 
155b2c4
a340739
 
 
 
 
 
 
 
 
 
155b2c4
a340739
 
 
 
 
 
155b2c4
a340739
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155b2c4
a340739
 
 
 
 
 
155b2c4
a340739
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
language: pt-br
license: mit
tags:
- LegalNLP
- NLP
- legal field
- python
- word2vec
- doc2vec
---


# ***LegalNLP*** - Natural Language Processing Methods for the Brazilian Legal Language ⚖️

### The library of Natural Language Processing for Brazilian legal language, *LegalNLP*, was born in a partnership between Brazilian researchers and the legal tech [Tikal Tech](https://www.tikal.tech) based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, ***LegalNLP*** provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.

You can access our paper by clicking [**here**](https://arxiv.org/abs/2110.15709). 

If you use our library in your academic work, please cite us in the following way

    @article{polo2021legalnlp,
      title={LegalNLP--Natural Language Processing methods for the Brazilian Legal Language},
      author={Polo, Felipe Maia and Mendon{\c{c}}a, Gabriel Caiaffa Floriano and Parreira, Kau{\^e} Capellato J and Gianvechio, Lucka and Cordeiro, Peterson and Ferreira, Jonathan Batista and de Lima, Leticia Maria Paz and Maia, Ant{\^o}nio Carlos do Amaral and Vicente, Renato},
      journal={arXiv preprint arXiv:2110.15709},
      year={2021}
    }

--------------

## Summary

0. [Accessing the Language Models](#0)
1. [ Introduction / Installing package](#1)
2. [ Language Models (Details / How to use)](#2)
    1.  [ Word2Vec/Doc2Vec ](#2.1)
3. [ Demonstrations / Tutorials](#3)
4. [ References](#4)

--------------

<a name="0"></a>
## 0\. Accessing the Language Models


All our models can be found [here](https://drive.google.com/drive/folders/1tCccOXPLSEAEUQtcWXvED3YaNJi3p7la?usp=sharing).

Please contact *felipemaiapolo@gmail.com* if you have any problem accessing the language models. 

--------------

<a name="1"></a>
## 1\. Introduction / Installing package
*LegalNLP* is promising given the scarcity of Natural Language Processing resources focused on the Brazilian legal language. It is worth mentioning that our library was made for Python, one of the most well-known programming languages for machine learning.


You first need to install the HuggingFaceHub library running the following command on terminal
``` :sh
$ pip install huggingface_hub
```

Import `hf_hub_download`:

```python
from huggingface_hub import hf_hub_download
```

And then you can download our Word2Vec(SG)/Doc2Vec(DBOW) and Word2Vec(CBOW)/Doc2Vec(DM) by the following commands:

```python
w2v_sg_d2v_dbow = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dbow_size_100_window_15_epochs_20")
w2v_cbow_d2v_dm = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dm_size_100_window_15_epochs_20")
```

--------------



<a name="2"></a>
## 2\. Model Languages

<a name="3.2"></a>
### 3.2\. Word2Vec/Doc2Vec

Our first models for generating vector representation for tokens and
texts (embeddings) are variations of the Word2Vec [1,
2] and Doc2Vec [3] methods. In short, the
Word2Vec methods generate embeddings for tokens5 and that somehow capture
the meaning of the various textual elements, based on the contexts in which these
elements appear. Doc2Vec methods are extensions/modifications of Word2Vec
for generating whole text representations.

Remember to at least make all letters lowercase. Please check our paper or [Gensim page](https://radimrehurek.com/gensim_3.8.3/models/doc2vec.html) for more details. Preferably use Gensim version 3.8.3.


Below we have a summary table with some important information about the trained models:



| Filenames       |  Doc2Vec | Word2Vec   | Size | Windows
|:-------------------:|:--------------:|:--------------:|:--------------:|:--------------:|
| ```w2v_d2v_dm*```     | Distributed Memory       (DM)             | Continuous Bag-of-Words (CBOW)          | 100, 200, 300 | 15 
| ```w2v_d2v_dbow*``` | Distributed Bag-of-Words (DBOW)               | Skip-Gram (SG)                   | 100, 200, 300      | 15 


Here we made available both models with 100 size and 15 window.

#### Using *Word2Vec*

Installing Gensim


```python
!pip install gensim=='3.8.3' 
```

Loading W2V:


```python
from gensim.models import KeyedVectors

#Loading a W2V model
w2v=KeyedVectors.load(w2v_cbow_d2v_dm)
w2v=w2v.wv
```
Viewing the first 10 entries of 'juiz' vector


```python
w2v['juiz'][:10]
```




    array([ 6.570131  , -1.262787  ,  5.156106  , -8.943866  , -5.884408  ,
           -7.717058  ,  1.8819941 , -8.02803   , -0.66901577,  6.7223144 ],
          dtype=float32)




Viewing closest tokens to 'juiz'

```python
w2v.most_similar('juiz')
```




    [('juíza', 0.8210258483886719),
     ('juiza', 0.7306275367736816),
     ('juíz', 0.691645085811615),
     ('juízo', 0.6605231165885925),
     ('magistrado', 0.6213295459747314),
     ('mmª_juíza', 0.5510469675064087),
     ('juizo', 0.5494943261146545),
     ('desembargador', 0.5313084721565247),
     ('mmjuiz', 0.5277603268623352),
     ('fabíola_melo_feijão_juíza', 0.5043971538543701)]


#### Using *Doc2Vec*
Installing Gensim


```python
!pip install gensim=='3.8.3' 
```

Loading D2V


```python
from gensim.models import Doc2Vec

#Loading a D2V model
d2v=Doc2Vec.load(w2v_cbow_d2v_dm)
```

Inferring vector for a text


```python
txt='direito do consumidor origem : bangu regional xxix juizado especial civel ação : [processo] - - recte : fundo de investimento em direitos creditórios'
tokens=txt.split()

txt_vec=d2v.infer_vector(tokens, epochs=20)
txt_vec[:10]
```




    array([ 0.02626514, -0.3876521 , -0.24873355, -0.0318402 ,  0.3343679 ,
           -0.21307918,  0.07193747,  0.02030687,  0.407305  ,  0.20065512],
          dtype=float32)




--------------

<a name="4"></a>
## 4\. Demonstrations

For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:

- **BERT notebook** : 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/felipemaiapolo/legalnlp/blob/main/demo/BERT/BERT_TUTORIAL.ipynb)
 
- **Word2Vec notebook** : 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/felipemaiapolo/legalnlp/blob/main/demo/Word2Vec/Word2Vec_TUTORIAL.ipynb)

- **Doc2Vec notebook** :
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/felipemaiapolo/legalnlp/blob/main/demo/Doc2Vec/Doc2Vec_TUTORIAL.ipynb)



--------------

<a name="5"></a>
## 5\. References

[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b).
Distributed representations of words and phrases and their compositionality.
In Advances in neural information processing systems, pages 3111–3119.

[2] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781.

[3] Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and
documents. In International conference on machine learning, pages 1188–1196.
PMLR.

[4] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching
word vectors with subword information. Transactions of the Association for
Computational Linguistics, 5:135–146.

[5] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training
of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805.

[6] Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT
models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent
Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23