Hailay commited on
Commit
26e3b4b
β€’
1 Parent(s): e67ae0a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -17
README.md CHANGED
@@ -1,39 +1,71 @@
 
 
 
 
 
 
1
  datasets:
2
  - Hailay/TigQA
3
- # Geez Word2Vec Model
4
-
5
- This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.
6
 
7
- ## Model Description
8
 
9
- The Word2Vec model in this repository has been trained to generate word embeddings for Geez script Tigrinya text. The model captures semantic relationships between words in the Geez language based on their context in the TIGQA dataset.
10
 
11
  ## Usage
12
 
13
- To use the trained Word2Vec model, follow these steps:
14
-
15
- 1. Clone this repository to your local machine.
16
- 2. Install the required dependencies (`spacy`, `gensim`).
17
- 3. Load the model using the provided Python code.
18
- 4. Use the model to generate Geez script Tigrinya text word embeddings.
19
-
20
- Example usage:
21
 
22
  ```python
23
  from gensim.models import Word2Vec
24
 
25
- # Load the trained Word2Vec model
26
- model = Word2Vec.load("Geez_word2vec_skipgram.model")
 
 
 
27
 
28
  # Get a vector for a word
29
- word_vector = model.wv['ሰα‰₯']
30
  print(f"Vector for 'ሰα‰₯': {word_vector}")
31
 
32
  # Find the most similar words
33
  similar_words = model.wv.most_similar('ሰα‰₯')
34
  print(f"Words similar to 'ሰα‰₯': {similar_words}")
35
 
36
- Dataset Source
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  The dataset for training this model contains text data in the Geez script of the Tigrinya language.
39
  It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Hailay/TigQA
5
+ pipeline_tag: sentence-similarity
6
+ ---
7
  datasets:
8
  - Hailay/TigQA
 
 
 
9
 
10
+ # Geez Word2Vec Skipgram Model
11
 
12
+ This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.
13
 
14
  ## Usage
15
 
16
+ You can download and use the model in your Python code as follows:
 
 
 
 
 
 
 
17
 
18
  ```python
19
  from gensim.models import Word2Vec
20
 
21
+ # URL of the model file on Hugging Face
22
+ model_url = "https://huggingface.co/Hailay/Geez_word2vec_skipgram.model/resolve/main/Geez_word2vec_skipgram.model"
23
+
24
+ # Load the trained Word2Vec model directly from the URL
25
+ model = Word2Vec.load(model_url)
26
 
27
  # Get a vector for a word
28
+ word_vector = model.wv['ሰα‰₯']
29
  print(f"Vector for 'ሰα‰₯': {word_vector}")
30
 
31
  # Find the most similar words
32
  similar_words = model.wv.most_similar('ሰα‰₯')
33
  print(f"Words similar to 'ሰα‰₯': {similar_words}")
34
 
35
+ #Visualizing Word Vectors
36
+ You can visualize the word vectors using t-SNE:
37
+ import matplotlib.pyplot as plt
38
+ from sklearn.manifold import TSNE
39
+ import numpy as np
40
+
41
+ # Words to visualize but you can change the words from the trained vocublary
42
+ words = ['ሰα‰₯', 'α‹“αˆˆαˆ', 'αˆ°αˆ‹αˆ', 'αˆ“α‹­αˆŠ','αŒŠα‹œ', 'α‰£αˆ…αˆͺ']
43
+
44
+ # Get the vectors for the words
45
+ word_vectors = np.array([model.wv[word] for word in words])
46
+
47
+ # Reduce dimensionality using t-SNE with a lower perplexity value
48
+ perplexity_value = min(5, len(words) - 1)
49
+ tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0)
50
+ word_vectors_2d = tsne.fit_transform(word_vectors)
51
+
52
+ # Create a scatter plot
53
+ plt.figure(figsize=(10, 6))
54
+ plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], edgecolors='k', c='r')
55
+
56
+ # Add annotations to the points
57
+ for i, word in enumerate(words):
58
+ plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]), xytext=(5, 2),
59
+ textcoords='offset points', ha='right', va='bottom')
60
+
61
+ plt.title('2D Visualization of Word2Vec Embeddings')
62
+ plt.xlabel('TSNE Component 1')
63
+ plt.ylabel('TSNE Component 2')
64
+ plt.grid(True)
65
+ plt.show()
66
+
67
+
68
+ ##Dataset Source
69
 
70
  The dataset for training this model contains text data in the Geez script of the Tigrinya language.
71
  It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.