LegalBERTPT-br

LegalBERTPT-br is a trained sentence embedding using SimCSE, a contrastive learning framework, coupled with the Portuguese pre-trained language model named BERTimbau.

Corpora

– From this site, we used the column Conteudo with 215,713 comments. We removed the comments from PL 3723/2019, PEC 471/2005, and Hashtag Corpus, in order to avoid bias.

– From this site, we also used 147,008 bills. From these projects, we used the summary field named txtEmenta and the project core text named txtExplicacaoEmenta.

– From Political Speeches, we used 462,831 texts, specifically, we used the columns: sumario, textodiscurso, and indexacao.

These corpora were segmented into sentences and concatenated, producing 2,307,426 sentences.

Citing and Authors

If you find this model helpful, feel free to cite our publication Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies:

@inproceedings{bracis,
 author = {Nádia Silva and Marília Silva and Fabíola Pereira and João Tarrega and João Beinotti and Márcio Fonseca and Francisco Andrade and André Carvalho},
 title = {Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies},
 booktitle = {Anais da X Brazilian Conference on Intelligent Systems},
 location = {Online},
 year = {2021},
 keywords = {},
 issn = {0000-0000},
 publisher = {SBC},
 address = {Porto Alegre, RS, Brasil},
 url = {https://sol.sbc.org.br/index.php/bracis/article/view/19061}
}