|
--- |
|
license: mit |
|
language: |
|
- en |
|
tags: |
|
- topic-modeling |
|
datasets: |
|
- CCRss/arxiv_papers_cs |
|
--- |
|
|
|
|
|
|
|
|
|
# Top2Vec Scientific Texts Model |
|
|
|
![MindMap](markmap-main.png) |
|
|
|
This repository hosts the `top2vec_scientific_texts` model, a specialized Top2Vec model trained on scientific texts for topic modeling and semantic search. |
|
|
|
## Model Overview |
|
|
|
The `top2vec_scientific_texts` model is built for analyzing scientific literature. It leverages the Universal Sentence Encoder for embedding texts and uses Top2Vec for topic modeling. |
|
|
|
### Key Features: |
|
|
|
- **Domain-Specific:** Tailored for scientific texts. |
|
- **Base Model:** Utilizes the Universal Sentence Encoder for effective text embeddings. |
|
- **Topic Modeling:** Employs Top2Vec for discovering topics in scientific documents. |
|
|
|
## Installation |
|
|
|
To use the model, you need to install the following dependencies: |
|
|
|
```bash |
|
pip install top2vec |
|
pip install top2vec[sentence_encoders] |
|
pip install tensorflow==2.8.0 |
|
pip install tensorflow-probability==0.16.0 |
|
``` |
|
|
|
## Model Training Process |
|
|
|
The entire process of model training, dataset creation, and visualization is documented in the `main.ipynb` Jupyter notebook. To explore the code and replicate the results: |
|
|
|
- Open the `main.ipynb` notebook in Jupyter Lab or Jupyter Notebook. |
|
- Execute the cells in sequence to run different stages of the analysis. |
|
- The results, including thematic group analysis, trend analysis, and visualizations of interest dynamics over the years, are presented in the form of tables and graphs within the notebook. |
|
|
|
For more details, please refer to the `main.ipynb` notebook in this repository. |
|
|
|
|
|
## Usage |
|
|
|
Here's an example of how to use the model for topic modeling: |
|
|
|
```bash |
|
from top2vec import Top2Vec |
|
|
|
# Load your documents |
|
docs = ["Document 1 text", "Document 2 text", ...] |
|
|
|
# Initialize the Top2Vec model |
|
model = Top2Vec( |
|
documents=docs, |
|
speed='learn', |
|
workers=80, |
|
embedding_model='universal-sentence-encoder', |
|
umap_args={'n_neighbors': 15, 'n_components': 5, 'metric': 'cosine', 'min_dist': 0.0, 'random_state': 42}, |
|
hdbscan_args={'min_cluster_size': 15, 'metric': 'euclidean', 'cluster_selection_method': 'eom'} |
|
) |
|
``` |
|
|
|
# Save the model |
|
|
|
```bash |
|
model.save('top2vec_scientific_texts_model') |
|
``` |
|
|
|
## Dataset |
|
|
|
The model was trained on a dataset of scientific abstracts sourced from [arXiv](https://arxiv.org/). The dataset covers a range of topics within the field of computer science from 2010 to 2024. |
|
|
|
You can access the dataset [arxiv_papers_cs](https://huggingface.co/datasets/CCRss/arxiv_papers_cs). |
|
|
|
## Use Cases |
|
|
|
The `top2vec_scientific_texts` model can be used for various purposes, including: |
|
|
|
- **Topic Discovery:** Identify the main topics within a collection of scientific texts. |
|
- **Semantic Search:** Find documents that are semantically similar to a query text. |
|
- **Trend Analysis:** Analyze the evolution of topics over time. |
|
|
|
## Examples |
|
|
|
Here are some examples of the model's output for the thematic group "UAV in Disasters and Emergency": |
|
|
|
### Trend Analysis for "UAV in Disasters and Emergency" |
|
|
|
![Trend Analysis](disasters_and_emergency_plot.png) |
|
|
|
This graph shows the trend of interest in the use of UAVs in disaster and emergency situations over time. |
|
|
|
### Key Metrics Table |
|
|
|
Analysis for Thematic Group: Disasters & Emergency |
|
| Year | Number of Publications | Growth Acceleration | Change in Number of Publications | Relative Growth | |
|
|-------:|-------------------------:|----------------------:|-----------------------------------:|:------------------| |
|
| 2010 | 19 | 0 | 0 | 0.0% | |
|
| 2011 | 15 | -4 | -4 | -21.05% | |
|
| 2012 | 28 | 17 | 13 | 86.67% | |
|
| 2013 | 38 | -3 | 10 | 35.71% | |
|
| 2014 | 28 | -20 | -10 | -26.32% | |
|
| 2015 | 47 | 29 | 19 | 67.86% | |
|
| 2016 | 63 | -3 | 16 | 34.04% | |
|
| 2017 | 94 | 15 | 31 | 49.21% | |
|
| 2018 | 173 | 48 | 79 | 84.04% | |
|
| 2019 | 266 | 14 | 93 | 53.76% | |
|
| 2020 | 337 | -22 | 71 | 26.69% | |
|
| 2021 | 380 | -28 | 43 | 12.76% | |
|
| 2022 | 453 | 30 | 73 | 19.21% | |
|
| 2023 | 509 | -17 | 56 | 12.36% | |
|
|
|
## Contributions |
|
|
|
We welcome contributions to the top2vec_scientific_texts model. If you have suggestions, improvements, or encounter any issues, please feel free to open an issue or submit a pull request. |
|
|
|
## License |
|
|
|
This project is licensed under the MIT License |