--- license: apache-2.0 datasets: - Hailay/TigQA pipeline_tag: sentence-similarity --- datasets: - Hailay/TigQA # Geez Word2Vec Skipgram Model This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy. ## Usage You can download and use the model in your Python code as follows: ```python from gensim.models import Word2Vec # URL of the model file on Hugging Face model_url = "https://huggingface.co/Hailay/Geez_word2vec_skipgram.model/resolve/main/Geez_word2vec_skipgram.model" # Load the trained Word2Vec model directly from the URL model = Word2Vec.load(model_url) # Get a vector for a word word_vector = model.wv['ሰብ'] print(f"Vector for 'ሰብ': {word_vector}") # Find the most similar words similar_words = model.wv.most_similar('ሰብ') print(f"Words similar to 'ሰብ': {similar_words}") #Visualizing Word Vectors You can visualize the word vectors using t-SNE: import matplotlib.pyplot as plt from sklearn.manifold import TSNE import numpy as np # Words to visualize but you can change the words from the trained vocublary words = ['ሰብ', 'ዓለም', 'ሰላም', 'ሓይሊ','ጊዜ', 'ባህሪ'] # Get the vectors for the words word_vectors = np.array([model.wv[word] for word in words]) # Reduce dimensionality using t-SNE with a lower perplexity value perplexity_value = min(5, len(words) - 1) tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0) word_vectors_2d = tsne.fit_transform(word_vectors) # Create a scatter plot plt.figure(figsize=(10, 6)) plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], edgecolors='k', c='r') # Add annotations to the points for i, word in enumerate(words): plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom') plt.title('2D Visualization of Word2Vec Embeddings') plt.xlabel('TSNE Component 1') plt.ylabel('TSNE Component 2') plt.grid(True) plt.show() ##Dataset Source The dataset for training this model contains text data in the Geez script of the Tigrinya language. It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development. For more information about the TIGQA dataset, visit this link. https://zenodo.org/records/11423987 and from HornMT License This Word2Vec model and its associated files are released under the MIT License.