Emanuel Huber commited on
Commit
c541339
1 Parent(s): f48f476

Updated project description

Browse files
Files changed (1) hide show
  1. top.html +16 -10
top.html CHANGED
@@ -3,17 +3,23 @@
3
  <h1 style="font-weight: 900; font-size: 3rem; margin: 20px;">
4
  Porttagger
5
  </h1>
6
- <p class="slogan">A Brazilian Portuguese part-of-speech tagger according to Universal
7
- Dependencies</p>
 
8
  </div>
9
  <p style="margin-top: 30px; margin-bottom: 10px; font-size: 94%; text-align: left;">
10
- Porttagger (Porttinari Part-Of-Speech) tagger was trained on the <a
11
- href="https://sites.google.com/icmc.usp.br/poetisa/resources-and-tools">Porttinari-base</a> corpus which is
12
- a collection of news extracted from the Folha de São Paulo newspaper site. The trained model is a fine-tuned
13
- version
14
- of <a href="https://huggingface.co/neuralmind/bert-base-portuguese-cased">Bertimbau</a> that receives tokens and
15
- outputs part-of-speech tags. Since the model expects a sequence of
16
- tokens
17
- for its inputs, <a src="https://spacy.io/models/pt">Spacy's</a> tokenization is used to tokenize the input text.
 
 
 
 
 
18
  </p>
19
  </div>
 
3
  <h1 style="font-weight: 900; font-size: 3rem; margin: 20px;">
4
  Porttagger
5
  </h1>
6
+ <p class="slogan">A Brazilian Portuguese part of speech tagger according to the <a
7
+ href="https://universaldependencies.org/">Universal Dependencies</a> model
8
+ </p>
9
  </div>
10
  <p style="margin-top: 30px; margin-bottom: 10px; font-size: 94%; text-align: left;">
11
+ Porttagger is a state of the art part of speech tagger for Brazilian Portuguese that automatically assigns
12
+ morphosyntactic classes to the words of sentences, following the Universal Dependencies international model. You
13
+ may provide single sentences or multiple sentences (using plain text files with several sentences) to be tagged.
14
+ You may also choose which trained model to use. The options include a model trained on news texts (using the
15
+ <a href="https://sites.google.com/icmc.usp.br/poetisa/resources-and-tools">Porttinari-base</a> corpus), on stock
16
+ market tweets (from the <a
17
+ href="https://www.kaggle.com/datasets/fernandojvdasilva/stock-tweets-ptbr-emotions">DANTE</a> corpus), on
18
+ academic texts from the oil & gas
19
+ domain (from the <a
20
+ href="https://github.com/UniversalDependencies/UD_Portuguese-PetroGold/blob/master/README.md">PetroGold</a>
21
+ corpus), and on all of them together. To the interested reader, this initiative is
22
+ part of the <a href="https://sites.google.com/icmc.usp.br/poetisa/">POeTiSA</a> project, where much more
23
+ information is available.
24
  </p>
25
  </div>