language:
- en
tags:
- word2vec
- embeddings
- nlp
- sports
- outdoors
- amazon-reviews
metrics:
- semantic similarity
Word2Vec Model for Amazon Sports & Outdoors Reviews
Model Description
This is a Word2Vec model trained on Amazon product reviews from the Sports & Outdoors category. The model was trained using the Gensim library on 296,337 reviews to learn word embeddings that capture semantic relationships between words in the context of sports and outdoor product reviews.
- Model type: Word2Vec (Skip-gram architecture)
- Training data: Amazon Sports & Outdoors reviews (296,337 reviews)
- Vocabulary size: Dependent on the min_count parameter (words appearing at least twice)
- Vector dimension: 100 (Gensim default)
- Window size: 10 words
Intended Uses & Limitations
Intended Use
This model is designed for:
- Semantic similarity tasks for sports and outdoor-related vocabulary
- Product recommendation systems
- Review analysis and sentiment tasks
- Keyword expansion and related term discovery
- Educational and research purposes
Limitations
- The model is specialized for the sports and outdoors domain
- Performance on vocabulary outside this domain may be limited
- Inherits any biases present in the Amazon review data
- May not perform well for very recent terminology not present in the training data
How to Use
Installation
pip install gensim pandas
Loading the Model
import gensim
# Load the model
model = gensim.models.Word2Vec.load("word2vec_model.model")
Getting Word Similarities
# Find words similar to "good"
similar_words = model.wv.most_similar("good", topn=5)
print(similar_words)
# Find words similar to "slow"
similar_words = model.wv.most_similar("slow", topn=5)
print(similar_words)
Additional Operations
# Get word vector
vector = model.wv['running']
# Calculate similarity between two words
similarity = model.wv.similarity('hiking', 'outdoors')
# Find odd one out
odd_one = model.wv.doesnt_match(['tent', 'sleeping bag', 'basketball'])
Training Details
Training Data
The model was trained on the Amazon Sports & Outdoors reviews dataset(https://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Sports_and_Outdoors_5.json.gz) containing 296,337 reviews with 9 columns each. The text was preprocessed using Gensim's simple_preprocess
function.
Hyperparameters
- Window size: 10
- Minimum word count: 2
- Vector size: 100 (default)
- Training algorithm: Skip-gram (default)
- Negative samples: 5 (default)
- epochs: 5 (default)
Evaluation
The model can be evaluated by examining the semantic relationships it captures. For example:
- It should find "excellent", "great", and "nice" similar to "good"
- It should find "fast", "quick" as antonyms to "slow"
- It should maintain sports-specific relationships (e.g., "football" related to "soccer")
Model Performance
While quantitative evaluation metrics like accuracy on analogy tasks are not provided, the model demonstrates meaningful semantic relationships for vocabulary in the sports and outdoors domain.
Ethical Considerations
- The model may reflect biases present in the original Amazon reviews
- Should not be used for automated decision making without human oversight
- Users should be aware that word embeddings can amplify societal biases
Citation
If you use this model in your research, please cite the original Amazon reviews dataset:
Please cite one or both of the following if you use the data in any way:
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
pdf
Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
pdf
}
License
The model is shared for research purposes. The original data follows Amazon's terms of use. ```