|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Hailay/TigQA |
|
pipeline_tag: sentence-similarity |
|
--- |
|
datasets: |
|
- Hailay/TigQA |
|
|
|
# Geez Word2Vec Skipgram Model |
|
|
|
This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy. |
|
|
|
## Usage |
|
|
|
You can download and use the model in your Python code as follows: |
|
|
|
```python |
|
from gensim.models import Word2Vec |
|
|
|
# URL of the model file on Hugging Face |
|
model_url = "https://huggingface.co/Hailay/Geez_word2vec_skipgram.model/resolve/main/Geez_word2vec_skipgram.model" |
|
|
|
# Load the trained Word2Vec model directly from the URL |
|
model = Word2Vec.load(model_url) |
|
|
|
# Get a vector for a word |
|
word_vector = model.wv['α°α₯'] |
|
print(f"Vector for 'α°α₯': {word_vector}") |
|
|
|
# Find the most similar words |
|
similar_words = model.wv.most_similar('α°α₯') |
|
print(f"Words similar to 'α°α₯': {similar_words}") |
|
|
|
#Visualizing Word Vectors |
|
You can visualize the word vectors using t-SNE: |
|
import matplotlib.pyplot as plt |
|
from sklearn.manifold import TSNE |
|
import numpy as np |
|
|
|
# Words to visualize but you can change the words from the trained vocublary |
|
words = ['α°α₯', 'ααα', 'α°αα', 'ααα','αα', 'α£α
αͺ'] |
|
|
|
# Get the vectors for the words |
|
word_vectors = np.array([model.wv[word] for word in words]) |
|
|
|
# Reduce dimensionality using t-SNE with a lower perplexity value |
|
perplexity_value = min(5, len(words) - 1) |
|
tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0) |
|
word_vectors_2d = tsne.fit_transform(word_vectors) |
|
|
|
# Create a scatter plot |
|
plt.figure(figsize=(10, 6)) |
|
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], edgecolors='k', c='r') |
|
|
|
# Add annotations to the points |
|
for i, word in enumerate(words): |
|
plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]), xytext=(5, 2), |
|
textcoords='offset points', ha='right', va='bottom') |
|
|
|
plt.title('2D Visualization of Word2Vec Embeddings') |
|
plt.xlabel('TSNE Component 1') |
|
plt.ylabel('TSNE Component 2') |
|
plt.grid(True) |
|
plt.show() |
|
|
|
|
|
##Dataset Source |
|
|
|
The dataset for training this model contains text data in the Geez script of the Tigrinya language. |
|
It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development. |
|
|
|
For more information about the TIGQA dataset, visit this link. https://zenodo.org/records/11423987 and from HornMT |
|
|
|
License |
|
This Word2Vec model and its associated files are released under the MIT License. |