metadata
license: apache-2.0
datasets:
- AiresPucrs/sentiment-analysis
language:
- en
metrics:
- accuracy
library_name: keras
english-embedding-vocabulary-16
Model Overview
The english-embedding-vocabulary-16 is a language model for sentiment analysis.
Details
- Size: 160,289 parameters
- Model type: word embeddings
- Optimizer: Adam
- Number of Epochs: 20
- Embedding size: 16
- Hardware: Tesla V4
- Emissions: Not measured
- Total Energy Consumption: Not measured
How to Use
To run inference on this model, you can use the following code snippet:
import numpy as np
import tensorflow as tf
from huggingface_hub import hf_hub_download
# Download the model
hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16",
filename="english_embedding_vocabulary_16.keras",
local_dir="./",
repo_type="model"
)
# Download the embedding vocabulary txt file
hf_hub_download(repo_id="AiresPucrs/english-embedding-vocabulary-16",
filename="english_embedding_vocabulary.txt",
local_dir="./",
repo_type="model"
)
model = tf.keras.models.load_model('english_embedding_vocabulary_16.keras')
# Compile the model
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
with open('english_embedding_vocabulary.txt', encoding='utf-8') as fp:
english_embedding_vocabulary = [line.strip() for line in fp]
fp.close()
embeddings = model.get_layer('embedding').get_weights()[0]
words_embeddings = {}
# iterating through the elements of list
for i, word in enumerate(english_embedding_vocabulary):
# here we skip the embedding/token 0 (""), because is just the PAD token.
if i == 0:
continue
words_embeddings[word] = embeddings[i]
print("Embeddings Dimensions: ", np.array(list(words_embeddings.values())).shape)
print("Vocabulary Size: ", len(words_embeddings.keys()))
Intended Use
This model was created for research purposes only. We do not recommend any application of this model outside this scope.
Performance Metrics
The model achieved an accuracy of 84% on validation data.
Training Data
The model was trained using a dataset that was put together by combining several datasets for sentiment classification available on Kaggle:
- The
IMDB 50K
dataset: 0K movie reviews for natural language processing or Text analytics. - The
Twitter US Airline Sentiment
dataset: originated from the Crowdflower's Data for Everyone library. - Our
google_play_apps_review
dataset: built using thegoogle_play_scraper
in this notebook. - The
EcoPreprocessed
dataset: scrapped amazon product reviews.
Limitations
We do not recommend using this model in real-world applications. It was solely developed for academic and educational purposes.
Cite as
@misc{teenytinycastle,
doi = {10.5281/zenodo.7112065},
url = {https://github.com/Nkluge-correa/teeny-tiny_castle},
author = {Nicholas Kluge Corr{\^e}a},
title = {Teeny-Tiny Castle},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
}
License
This model is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.