CC100 GloVe Embeddings for UR Language
Model Description
- Language: ur
- Embedding Algorithm: GloVe (Global Vectors for Word Representation)
- Vocabulary Size: 682232
- Vector Dimensions: 300
- Training Data: CC100 dataset
Training Information
We trained GloVe embeddings using the original C code. The model was trained by stochastically sampling nonzero elements from the co-occurrence matrix, over 100 iterations, to produce 300-dimensional vectors. We used a context window of ten words to the left and ten words to the right. Words with fewer than 5 co-occurrences were excluded for languages with over 1 million tokens in the training data, and the threshold was set to 2 for languages with smaller datasets.
We used data from CC100 for training the static word embeddings. We set xmax = 100, α = 3/4, and used AdaGrad optimization with an initial learning rate of 0.05.
Usage
These embeddings can be used for various NLP tasks such as text classification, named entity recognition, and as input features for neural networks.
Citation
If you use these embeddings in your research, please cite:
@misc{gurgurov2024gremlinrepositorygreenbaseline,
title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge},
author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
year={2024},
eprint={2409.18193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.18193},
}
License
These embeddings are released under the CC-BY-SA 4.0 License.