osiria
/

word2vec-light-uncased-it

Model card Files Files and versions Community

osiria commited on May 7, 2023

Commit

b0f74d6

•

1 Parent(s): a88f7ab

Update README.md

Files changed (1) hide show

README.md +69 -0

README.md CHANGED Viewed

@@ -1,3 +1,72 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+datasets:
+- wikipedia
+language:
+- it
 ---
+--------------------------------------------------------------------------------------------------
+<body>
+<span class="vertical-text" style="background-color:lightgreen;border-radius: 3px;padding: 3px;"> </span>
+<br>
+<span class="vertical-text" style="background-color:orange;border-radius: 3px;padding: 3px;">  </span>
+<br>
+<span class="vertical-text" style="background-color:lightblue;border-radius: 3px;padding: 3px;">    Model: Word2Vec</span>
+<br>
+<span class="vertical-text" style="background-color:tomato;border-radius: 3px;padding: 3px;">    Lang: IT</span>
+<br>
+<span class="vertical-text" style="background-color:lightgrey;border-radius: 3px;padding: 3px;">  </span>
+<br>
+<span class="vertical-text" style="background-color:#CF9FFF;border-radius: 3px;padding: 3px;"> </span>
+</body>
+--------------------------------------------------------------------------------------------------
+<h3>Model description</h3>
+This model is a <b>lightweight</b> and uncased version of <b>Word2Vec</b> <b>[1]</b> for the <b>italian</b> language. It's implemented in Gensim and it provides embeddings for 560.509 uncased italian words in a 100-dimensional vector space, resulting in a total model size of about 245 MB.
+<h3>Training procedure</h3>
+The model was trained on the italian split of the Wikipedia dataset (about 3.7GB, lowercased and pre-processed) for 10 epochs, using a window size of 5 and including words with a minimum count of 10, with an initial learning rate of 2.5e-3
+<h3>Quick usage</h3>
+Download the files in a local folder called "word2vec-light-uncased-it", then run:
+```python
+from gensim.models import KeyedVectors
+model = KeyedVectors.load("./word2vec-light-uncased-it/word2vec.wordvectors", mmap='r')
+model.most_similar("poesia", topn=5)
+```
+Expected output:
+```
+[('letteratura', 0.8193784356117249),
+ ('poetica', 0.8115736246109009),
+ ('narrativa', 0.7729100584983826),
+ ('drammaturgia', 0.7576397061347961),
+ ('prosa', 0.7552034854888916)]
+```
+<h3>Limitations</h3>
+This lightweight model is trained on Wikipedia, so it's particularly suitable for natively digital text
+from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.).
+However, it may show limitations when it comes to chaotic text, containing errors and slang expressions
+(like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).
+<h3>References</h3>
+[1] https://arxiv.org/abs/1301.3781
+<h3>License</h3>
+The model is released under <b>Apache-2.0</b> license