mtreviso commited on
Commit
798867e
·
verified ·
1 Parent(s): 22243cf

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +69 -19
README.md CHANGED
@@ -1,13 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # NILC Portuguese Word Embeddings — Wang2Vec Skip-Gram 50d
2
 
3
- Pretrained **static word embeddings** for **Portuguese** (Brazilian + European), trained by the [NILC group](http://nilc.icmc.usp.br/) on a large multi-genre corpus (~1.39B tokens, 17 sources).
 
4
 
5
- This repository contains the **Wang2Vec Skip-Gram 50d** model in safetensors format.
 
 
 
6
 
7
  ---
8
 
9
  ## 📂 Files
10
- - `embeddings.safetensors` → word vectors (`[vocab_size, 50]`)
11
  - `vocab.txt` → vocabulary (one token per line, aligned with rows)
12
 
13
  ---
@@ -15,16 +33,21 @@ This repository contains the **Wang2Vec Skip-Gram 50d** model in safetensors for
15
  ## 🚀 Usage
16
 
17
  ```python
 
18
  from safetensors.numpy import load_file
19
 
20
- data = load_file("embeddings.safetensors")
 
 
 
21
  vectors = data["embeddings"]
22
 
23
- with open("vocab.txt") as f:
 
 
24
  vocab = [w.strip() for w in f]
25
 
26
- word2idx = {w: i for i, w in enumerate(vocab)}
27
- print(vectors[word2idx["rei"]]) # vector for "rei"
28
  ```
29
 
30
  Or in PyTorch:
@@ -37,19 +60,46 @@ vectors = tensors["embeddings"] # torch.Tensor
37
 
38
  ---
39
 
40
- ## 📖 Reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```bibtex
42
- @inproceedings{hartmann-etal-2017-portuguese,
43
- title = {{P}ortuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks},
44
- author = {Hartmann, Nathan and Fonseca, Erick and Shulby, Christopher and Treviso, Marcos and Silva, J{'e}ssica and Alu{'i}sio, Sandra},
45
- year = 2017,
46
- month = oct,
47
- booktitle = {Proceedings of the 11th {B}razilian Symposium in Information and Human Language Technology},
48
- publisher = {Sociedade Brasileira de Computa{\c{c}}{\~a}o},
49
- address = {Uberl{\^a}ndia, Brazil},
50
- pages = {122--131},
51
- url = {https://aclanthology.org/W17-6615/},
52
- editor = {Paetzold, Gustavo Henrique and Pinheiro, Vl{'a}dia}
53
  }
54
  ```
55
 
 
1
+ ---
2
+ language: pt
3
+ tags:
4
+ - word-embeddings
5
+ - static
6
+ - portuguese
7
+ - wang2vec
8
+ - skip-gram
9
+ - 50d
10
+ license: cc-by-4.0
11
+ library_name: safetensors
12
+ pipeline_tag: feature-extraction
13
+ ---
14
+
15
  # NILC Portuguese Word Embeddings — Wang2Vec Skip-Gram 50d
16
 
17
+ NILC-Embeddings is a repository for storing and sharing **word embeddings** for the Portuguese language.
18
+ The goal is to provide ready-to-use vector resources for **Natural Language Processing (NLP)** and **Machine Learning** tasks.
19
 
20
+ The embeddings were trained on a large Portuguese corpus (Brazilian + European), composed of 17 corpora (~1.39B tokens).
21
+ Training was carried out with the following algorithms: **Word2Vec** [1], **FastText** [2], **Wang2Vec** [3], and **GloVe** [4].
22
+
23
+ This repository contains the **Wang2Vec Skip-Gram 50d** model in **safetensors** format.
24
 
25
  ---
26
 
27
  ## 📂 Files
28
+ - `embeddings.safetensors` → embedding matrix (`[vocab_size, 50]`)
29
  - `vocab.txt` → vocabulary (one token per line, aligned with rows)
30
 
31
  ---
 
33
  ## 🚀 Usage
34
 
35
  ```python
36
+ from huggingface_hub import hf_hub_download
37
  from safetensors.numpy import load_file
38
 
39
+ path = hf_hub_download(repo_id="nilc-nlp/wang2vec-skip-gram-50d",
40
+ filename="embeddings.safetensors")
41
+
42
+ data = load_file(path)
43
  vectors = data["embeddings"]
44
 
45
+ vocab_path = hf_hub_download(repo_id="nilc-nlp/wang2vec-skip-gram-50d",
46
+ filename="vocab.txt")
47
+ with open(vocab_path) as f:
48
  vocab = [w.strip() for w in f]
49
 
50
+ print(vectors.shape)
 
51
  ```
52
 
53
  Or in PyTorch:
 
60
 
61
  ---
62
 
63
+ ## 📊 Corpus
64
+
65
+ The embeddings were trained on a combination of 17 corpora (~1.39B tokens):
66
+
67
+ | Corpus | Tokens | Types | Genre | Description |
68
+ |--------|--------|-------|-------|-------------|
69
+ | LX-Corpus [Rodrigues et al. 2016] | 714,286,638 | 2,605,393 | Mixed genres | Large collection of texts from 19 sources, mostly European Portuguese |
70
+ | Wikipedia | 219,293,003 | 1,758,191 | Encyclopedic | Wikipedia dump (2016-10-20) |
71
+ | GoogleNews | 160,396,456 | 664,320 | Informative | News crawled from Google News |
72
+ | SubIMDB-PT | 129,975,149 | 500,302 | Spoken | Movie subtitles from IMDb |
73
+ | G1 | 105,341,070 | 392,635 | Informative | News from G1 portal (2014–2015) |
74
+ | PLN-Br [Bruckschen et al. 2008] | 31,196,395 | 259,762 | Informative | Corpus of PLN-BR project (1994–2005) |
75
+ | Domínio Público | 23,750,521 | 381,697 | Prose | 138,268 literary works |
76
+ | Lacio-Web [Aluísio et al. 2003] | 8,962,718 | 196,077 | Mixed | Literary, informative, scientific, law, didactic texts |
77
+ | Literatura Brasileira | 1,299,008 | 66,706 | Prose | Classical Brazilian fiction e-books |
78
+ | Mundo Estranho | 1,047,108 | 55,000 | Informative | Texts from Mundo Estranho magazine |
79
+ | CHC | 941,032 | 36,522 | Informative | Texts from Ciência Hoje das Crianças |
80
+ | FAPESP | 499,008 | 31,746 | Science communication | Texts from Pesquisa FAPESP magazine |
81
+ | Textbooks | 96,209 | 11,597 | Didactic | Elementary school textbooks |
82
+ | Folhinha | 73,575 | 9,207 | Informative | Children’s news from Folhinha (Folha de São Paulo) |
83
+ | NILC subcorpus | 32,868 | 4,064 | Informative | Children’s texts (3rd–4th grade) |
84
+ | Para Seu Filho Ler | 21,224 | 3,942 | Informative | Children’s news from Zero Hora |
85
+ | SARESP | 13,308 | 3,293 | Didactic | School evaluation texts |
86
+ | **Total** | **1,395,926,282** | **3,827,725** | — | —
87
+
88
+ ---
89
+
90
+ ## 📖 Paper
91
+
92
+ **Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks**
93
+ Hartmann, N. et al. (2017), STIL 2017.
94
+ [ArXiv Paper](https://arxiv.org/abs/1708.06025)
95
+
96
+ ### BibTeX
97
  ```bibtex
98
+ @inproceedings{hartmann2017nilc,
99
+ title={Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks},
100
+ author={Hartmann, Nathan and Fonseca, Erick and Shulby, Christopher and Treviso, Marcos and Rodrigues, Juliano and Aluísio, Sandra},
101
+ booktitle={Proceedings of the Symposium in Information and Human Language Technology (STIL)},
102
+ year={2017}
 
 
 
 
 
 
103
  }
104
  ```
105