Hailay
/

Geez_word2vec_skipgram.model

Sentence Similarity

Inference Endpoints

Model card Files Files and versions Community

Geez_word2vec_skipgram.model / README.md

Hailay's picture

Update README.md

26e3b4b verified 4 months ago

|

history blame contribute delete

2.45 kB

	---
	license: apache-2.0
	datasets:
	- Hailay/TigQA
	pipeline_tag: sentence-similarity
	---
	datasets:
	- Hailay/TigQA

	# Geez Word2Vec Skipgram Model

	This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.

	## Usage

	You can download and use the model in your Python code as follows:

	```python
	from gensim.models import Word2Vec

	# URL of the model file on Hugging Face
	model_url = "https://huggingface.co/Hailay/Geez_word2vec_skipgram.model/resolve/main/Geez_word2vec_skipgram.model"

	# Load the trained Word2Vec model directly from the URL
	model = Word2Vec.load(model_url)

	# Get a vector for a word
	word_vector = model.wv['ሰብ']
	print(f"Vector for 'ሰብ': {word_vector}")

	# Find the most similar words
	similar_words = model.wv.most_similar('ሰብ')
	print(f"Words similar to 'ሰብ': {similar_words}")

	#Visualizing Word Vectors
	You can visualize the word vectors using t-SNE:
	import matplotlib.pyplot as plt
	from sklearn.manifold import TSNE
	import numpy as np

	# Words to visualize but you can change the words from the trained vocublary
	words = ['ሰብ', 'ዓለም', 'ሰላም', 'ሓይሊ','ጊዜ', 'ባህሪ']

	# Get the vectors for the words
	word_vectors = np.array([model.wv[word] for word in words])

	# Reduce dimensionality using t-SNE with a lower perplexity value
	perplexity_value = min(5, len(words) - 1)
	tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0)
	word_vectors_2d = tsne.fit_transform(word_vectors)

	# Create a scatter plot
	plt.figure(figsize=(10, 6))
	plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], edgecolors='k', c='r')

	# Add annotations to the points
	for i, word in enumerate(words):
	plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]), xytext=(5, 2),
	textcoords='offset points', ha='right', va='bottom')

	plt.title('2D Visualization of Word2Vec Embeddings')
	plt.xlabel('TSNE Component 1')
	plt.ylabel('TSNE Component 2')
	plt.grid(True)
	plt.show()


	##Dataset Source

	The dataset for training this model contains text data in the Geez script of the Tigrinya language.
	It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.

	For more information about the TIGQA dataset, visit this link. https://zenodo.org/records/11423987 and from HornMT

	License
	This Word2Vec model and its associated files are released under the MIT License.