Sent2vec trained with data from the descriptive text corpus of the CelebA dataset
- Language: Spanish
- Data: CelebA_Sent2vec_Sp.
- Architecture: Sent2vec
Sent2vec can be used directly for English texts. For this purpose, all you have to do is download the library and enter the text to be coded, since most of these algorithms were trained using English as the original language. However, since this work is used with text in Spanish, it has been necessary to train it from zero in this new language. This training was carried out using the generated corpus (in this respository) with the following process:
- A corpus composed of a set of descriptive sentences of characteristics of each of the faces of the CelebA dataset in Spanish has been generated. A total of 192,209 sentences are available for training.
- Apply a pre-processing consisting of removing accents. stopwords and connectors were retained as part of the sentence structure during training.
- Install the libraries Sent2vec and FastText, and configure the parameters. The parameters have been fixed empirically after several
- tests, being: 4,800 dimensions of feature vectors, 5,000 epochs, 200 threads, 2 n-grams and a learning rate of 0.05.
In this context, the total training time lasted 7 hours working with all CPUs at maximum performance. As a result, it generates a bin extension file which can be downloaded from this repository.
How to use
Download the model, as a result there is a sent2vec_celebAEs-UNI.bin file which will be loaded using the sent2vec library in Python as follows:
import sent2vec Model_path="sent2vec_celebAEs-UNI.bin" s2vmodel = sent2vec.Sent2vecModel() s2vmodel.load_model(Model_path) caption = """El hombre luce una sombra a las 5 en punto. Su cabello es de color negro. Tiene una nariz grande con cejas tupidas. El hombre se ve atractivo""" vector = s2vmodel.embed_sentence(caption) print(vector)
As a result, the encoder will generate a numeric vector whose dimension is 4800.
>>$ print(vector) >>$ [[0.1,0.87,0.51,........0.7]] >>$ len(vector) >>$ 4800
To see detailed information on the use of the trained model, enter the following link
This model is available under the CC BY-NC 4.0.
Citing: If you used Sent2vec+CelebA model in your work, please cite the ????:
Universidad Nacional de Ingeniería, Ontology Engineering Group, Universidad Politécnica de Madrid.
See the full list of contributors here.
- Downloads last month
Unable to determine this model’s pipeline type. Check the docs .