--- license: apache-2.0 datasets: - AiresPucrs/sentiment-analysis language: - en metrics: - accuracy library_name: keras --- # Embedding-model-16 ## Model Overview The Embedding-model-16 is a language model for sentiment analysis. ### Details - **Size:** 160,289 parameters - **Model type:** word embeddings - **Optimizer**: Adam - **Number of Epochs:** 20 - **Embedding size:** 16 - **Hardware:** Tesla V4 - **Emissions:** Not measured - **Total Energy Consumption:** Not measured ### How to Use To run inference on this model, you can use the following code snippet: ```python import numpy as np import tensorflow as tf from huggingface_hub import hf_hub_download # Download the model hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16", filename="english_embedding_vocabulary_16.keras", local_dir="./", repo_type="model" ) # Download the embedding vocabulary txt file hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16", filename="english_embedding_vocabulary.txt", local_dir="./", repo_type="model" ) model = tf.keras.models.load_model('english_embedding_vocabulary_16.keras') # Compile the model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) with open('english_embedding_vocabulary.txt', encoding='utf-8') as fp: english_embedding_vocabulary = [line.strip() for line in fp] fp.close() embeddings = model.get_layer('embedding').get_weights()[0] words_embeddings = {} # iterating through the elements of list for i, word in enumerate(english_embedding_vocabulary): # here we skip the embedding/token 0 (""), because is just the PAD token. if i == 0: continue words_embeddings[word] = embeddings[i] print("Embeddings Dimensions: ", np.array(list(words_embeddings.values())).shape) print("Vocabulary Size: ", len(words_embeddings.keys())) ``` ## Intended Use This model was created for research purposes only. We do not recommend any application of this model outside this scope. ## Performance Metrics The model achieved an accuracy of 84% on validation data. ## Training Data The model was trained using a dataset that was put together by combining several datasets for sentiment classification available on [Kaggle](https://www.kaggle.com/): - The `IMDB 50K` [dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv): _0K movie reviews for natural language processing or Text analytics._ - The `Twitter US Airline Sentiment` [dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment): _originated from the [Crowdflower's Data for Everyone library](http://www.crowdflower.com/data-for-everyone)._ - Our `google_play_apps_review` _dataset: built using the `google_play_scraper` in [this notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/master/ML%20Explainability/NLP%20Interpreter%20(en)/scrape(en).ipynb)._ - The `EcoPreprocessed` [dataset](https://www.kaggle.com/datasets/pradeeshprabhakar/preprocessed-dataset-sentiment-analysis): _scrapped amazon product reviews_. ## Limitations We do not recommend using this model in real-world applications. It was solely developed for academic and educational purposes. ## Cite as ```latex @misc{teenytinycastle, doi = {10.5281/zenodo.7112065}, url = {https://github.com/Nkluge-correa/teeny-tiny_castle}, author = {Nicholas Kluge Corr{\^e}a}, title = {Teeny-Tiny Castle}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, } ``` ## License This model is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.