--- license: mit language: - en tags: - topic-modeling datasets: - CCRss/arxiv_papers_cs --- # Top2Vec Scientific Texts Model ![MindMap](markmap-main.png) This repository hosts the `top2vec_scientific_texts` model, a specialized Top2Vec model trained on scientific texts for topic modeling and semantic search. ## Model Overview The `top2vec_scientific_texts` model is built for analyzing scientific literature. It leverages the Universal Sentence Encoder for embedding texts and uses Top2Vec for topic modeling. ### Key Features: - **Domain-Specific:** Tailored for scientific texts. - **Base Model:** Utilizes the Universal Sentence Encoder for effective text embeddings. - **Topic Modeling:** Employs Top2Vec for discovering topics in scientific documents. ## Installation To use the model, you need to install the following dependencies: ```bash pip install top2vec pip install top2vec[sentence_encoders] pip install tensorflow==2.8.0 pip install tensorflow-probability==0.16.0 ``` ## Usage Here's an example of how to use the model for topic modeling: ```bash from top2vec import Top2Vec # Load your documents docs = ["Document 1 text", "Document 2 text", ...] # Initialize the Top2Vec model model = Top2Vec( documents=docs, speed='learn', workers=80, embedding_model='universal-sentence-encoder', umap_args={'n_neighbors': 15, 'n_components': 5, 'metric': 'cosine', 'min_dist': 0.0, 'random_state': 42}, hdbscan_args={'min_cluster_size': 15, 'metric': 'euclidean', 'cluster_selection_method': 'eom'} ) ``` # Save the model ```bash model.save('top2vec_scientific_texts_model') ``` ## Dataset The model was trained on a dataset of scientific abstracts sourced from [arXiv](https://arxiv.org/). The dataset covers a range of topics within the field of computer science from 2010 to 2024. You can access the dataset [arxiv_papers_cs](https://huggingface.co/datasets/CCRss/arxiv_papers_cs). ## Use Cases The `top2vec_scientific_texts` model can be used for various purposes, including: - **Topic Discovery:** Identify the main topics within a collection of scientific texts. - **Semantic Search:** Find documents that are semantically similar to a query text. - **Trend Analysis:** Analyze the evolution of topics over time. ## Examples Here are some examples of the model's output for the thematic group "UAV in Disasters and Emergency": ### Trend Analysis for "UAV in Disasters and Emergency" ![Trend Analysis](disasters_and_emergency_plot.png) This graph shows the trend of interest in the use of UAVs in disaster and emergency situations over time. ### Key Metrics Table Analysis for Thematic Group: Disasters & Emergency | Year | Number of Publications | Growth Acceleration | Change in Number of Publications | Relative Growth | |-------:|-------------------------:|----------------------:|-----------------------------------:|:------------------| | 2010 | 19 | 0 | 0 | 0.0% | | 2011 | 15 | -4 | -4 | -21.05% | | 2012 | 28 | 17 | 13 | 86.67% | | 2013 | 38 | -3 | 10 | 35.71% | | 2014 | 28 | -20 | -10 | -26.32% | | 2015 | 47 | 29 | 19 | 67.86% | | 2016 | 63 | -3 | 16 | 34.04% | | 2017 | 94 | 15 | 31 | 49.21% | | 2018 | 173 | 48 | 79 | 84.04% | | 2019 | 266 | 14 | 93 | 53.76% | | 2020 | 337 | -22 | 71 | 26.69% | | 2021 | 380 | -28 | 43 | 12.76% | | 2022 | 453 | 30 | 73 | 19.21% | | 2023 | 509 | -17 | 56 | 12.36% | ## Contributions We welcome contributions to the top2vec_scientific_texts model. If you have suggestions, improvements, or encounter any issues, please feel free to open an issue or submit a pull request. ## License This project is licensed under the MIT License