--- language: pt license: mit tags: - sentence-transformers duplicated_from: ulysses-camara/legal-bert-pt-br --- # LegalBERTPT-br LegalBERTPT-br is a trained sentence embedding using SimCSE, a contrastive learning framework, coupled with the Portuguese pre-trained language model named [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased). # Corpora – From [this site](https://www2.camara.leg.br/transparencia/servicos-ao-cidadao/participacao-popular), we used the column `Conteudo` with 215,713 comments. We removed the comments from PL 3723/2019, PEC 471/2005, and Hashtag Corpus, in order to avoid bias. – From [this site](https://www2.camara.leg.br/transparencia/servicos-ao-cidadao/participacao-popular), we also used 147,008 bills. From these projects, we used the summary field named `txtEmenta` and the project core text named `txtExplicacaoEmenta`. – From Political Speeches, we used 462,831 texts, specifically, we used the columns: `sumario`, `textodiscurso`, and `indexacao`. These corpora were segmented into sentences and concatenated, producing 2,307,426 sentences. # Citing and Authors If you find this model helpful, feel free to cite our publication [Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies](https://link.springer.com/chapter/10.1007/978-3-030-91699-2_8): ```bibtex @inproceedings{bracis, author = {Nádia Silva and Marília Silva and Fabíola Pereira and João Tarrega and João Beinotti and Márcio Fonseca and Francisco Andrade and André Carvalho}, title = {Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies}, booktitle = {Anais da X Brazilian Conference on Intelligent Systems}, location = {Online}, year = {2021}, keywords = {}, issn = {0000-0000}, publisher = {SBC}, address = {Porto Alegre, RS, Brasil}, url = {https://sol.sbc.org.br/index.php/bracis/article/view/19061} } ```