{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "3YOdAD8OzLqz" }, "source": [ "# Sharing BERTopic models on the Hugging Face Hub\n", "\n", "This notebook shows the steps involved in sharing a BERTopic model on the Hugging Face Hub. As an example, we'll train a topic model on GitHub issue titles for the Transformers library. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "TkOqnFiZzVRZ" }, "source": [ "First we need to install `BERTopic` along with the `huggingface_hub` library. We can optionally also install [`safetensors`](https://huggingface.co/docs/safetensors/index). `safetensors` Safetensors is a new simple format for storing tensors safely (as opposed to pickle) that is still fast (zero-copy). If this library is installed, BERTopic can use the `safetensor` format for model serialization. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nGhtumgZTvlE", "outputId": "6de65d7d-e963-4d3a-f927-0a3ece0277b4", "vscode": { "languageId": "python" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m224.5/224.5 kB\u001b[0m \u001b[31m2.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m66.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.2/5.2 MB\u001b[0m \u001b[31m96.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m88.2/88.2 kB\u001b[0m \u001b[31m11.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.0/86.0 kB\u001b[0m \u001b[31m10.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.1/7.1 MB\u001b[0m \u001b[31m84.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m54.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m29.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m119.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h Building wheel for bertopic (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for hdbscan (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for sentence-transformers (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for umap-learn (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for pynndescent (setup.py) ... \u001b[?25l\u001b[?25hdone\n" ] } ], "source": [ "%pip install git+https://github.com/MaartenGr/BERTopic huggingface_hub safetensors -qqq" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "Lfx9GTwnz6S4" }, "source": [ "We can use a [dataset](https://github.com/nlp-with-transformers/notebooks) that has been created for the [Natural Language Processing with Transformers](https://github.com/nlp-with-transformers/notebooks) book. This dataset contains issue titles, along with some metadata for the Transformers library GitHub repository. \n", "\n", "GitHub issues are an example of a domain where me might assume some sort of topics exist in the corpus, but we probablydon't have an exact sense of what all of these topics would be. This is the type of problem where topic modelling can give us a better sense of the corpus and potentially be useful for classifying new issues into topics. \n", "\n", "We'll start by loading the data into a pandas DataFrame. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nnMEq1vMT5Kv", "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "dataset_url = \"https://raw.githubusercontent.com/nlp-with-transformers/notebooks/main/data/github-issues-transformers.jsonl\"\n", "df_issues = pd.read_json(dataset_url, lines=True)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 494 }, "id": "s4kh0GNZtkg7", "outputId": "fea4924d-bf9f-4bde-80be-793b7fe32bf7", "vscode": { "languageId": "python" } }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " | url | \n", "repository_url | \n", "labels_url | \n", "comments_url | \n", "events_url | \n", "html_url | \n", "id | \n", "node_id | \n", "number | \n", "title | \n", "... | \n", "milestone | \n", "comments | \n", "created_at | \n", "updated_at | \n", "closed_at | \n", "author_association | \n", "active_lock_reason | \n", "body | \n", "performed_via_github_app | \n", "pull_request | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://github.com/huggingface/transformers/is... | \n", "849568459 | \n", "MDU6SXNzdWU4NDk1Njg0NTk= | \n", "11046 | \n", "Potential incorrect application of layer norm ... | \n", "... | \n", "NaN | \n", "0 | \n", "2021-04-03 03:37:32 | \n", "2021-04-03 03:37:32 | \n", "NaT | \n", "NONE | \n", "None | \n", "In BlenderbotSmallDecoder, layer norm is appl... | \n", "NaN | \n", "None | \n", "
1 | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://github.com/huggingface/transformers/is... | \n", "849544374 | \n", "MDU6SXNzdWU4NDk1NDQzNzQ= | \n", "11045 | \n", "Multi-GPU seq2seq example evaluation significa... | \n", "... | \n", "NaN | \n", "0 | \n", "2021-04-03 00:52:24 | \n", "2021-04-03 00:52:24 | \n", "NaT | \n", "NONE | \n", "None | \n", "\\r\\n### Who can help\\r\\n@patil-suraj @sgugger ... | \n", "NaN | \n", "None | \n", "
2 | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://github.com/huggingface/transformers/is... | \n", "849529761 | \n", "MDU6SXNzdWU4NDk1Mjk3NjE= | \n", "11044 | \n", "[DeepSpeed] ZeRO stage 3 integration: getting ... | \n", "... | \n", "NaN | \n", "0 | \n", "2021-04-02 23:40:42 | \n", "2021-04-03 00:00:18 | \n", "NaT | \n", "COLLABORATOR | \n", "None | \n", "**[This is not yet alive, preparing for the re... | \n", "NaN | \n", "None | \n", "
3 | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://api.github.com/repos/huggingface/trans... | \n", "https://github.com/huggingface/transformers/is... | \n", "849499734 | \n", "MDU6SXNzdWU4NDk0OTk3MzQ= | \n", "11043 | \n", "Can't load model to estimater | \n", "... | \n", "NaN | \n", "0 | \n", "2021-04-02 21:51:44 | \n", "2021-04-02 21:51:44 | \n", "NaT | \n", "NONE | \n", "None | \n", "I was trying to follow the Sagemaker instructi... | \n", "NaN | \n", "None | \n", "
4 rows × 26 columns
\n", "\n", " | Topic | \n", "Count | \n", "Name | \n", "Representation | \n", "Representative_Docs | \n", "
---|---|---|---|---|---|
0 | \n", "-1 | \n", "2106 | \n", "-1_bert_tensorflow_model_models | \n", "[bert, tensorflow, model, models, tf, tokenize... | \n", "[t5 model card, TFDistilBERT ValueError when l... | \n", "
1 | \n", "0 | \n", "1774 | \n", "0_bert_bertforsequenceclassification_berttoken... | \n", "[bert, bertforsequenceclassification, berttoke... | \n", "[The output to be used for getting sentence em... | \n", "
2 | \n", "1 | \n", "1122 | \n", "1_gpt2_trainertrain_gpt_trainer | \n", "[gpt2, trainertrain, gpt, trainer, training, c... | \n", "[Training GPT2 and Reformer from scratch. , A... | \n", "
3 | \n", "2 | \n", "516 | \n", "2_typos_typo_fix_fixed | \n", "[typos, typo, fix, fixed, correction, error, c... | \n", "[fix typo, Fix doc link in README, [doc] typo ... | \n", "
4 | \n", "3 | \n", "464 | \n", "3_s2s_seq2seq_examplesseq2seq_seq2seqdataset | \n", "[s2s, seq2seq, examplesseq2seq, seq2seqdataset... | \n", "[[s2s] --eval_max_generate_length, [s2s] s/alp... | \n", "
5 | \n", "4 | \n", "404 | \n", "4_modelcard_modelcards_card_model | \n", "[modelcard, modelcards, card, model, cards, mo... | \n", "[Add model card, Add model card, Model Card fo... | \n", "
6 | \n", "5 | \n", "368 | \n", "5_attributeerror_valueerror_typeerror_error | \n", "[attributeerror, valueerror, typeerror, error,... | \n", "[TypeError: on_init_end() got an unexpected ke... | \n", "
7 | \n", "6 | \n", "347 | \n", "6_summarization_summaries_questionansweringpip... | \n", "[summarization, summaries, questionansweringpi... | \n", "[Bug in the question answering pipeline, Add t... | \n", "
8 | \n", "7 | \n", "329 | \n", "7_longformer_tf_longformers_tftrainer | \n", "[longformer, tf, longformers, tftrainer, longf... | \n", "[TF Longformer, Fix TF Longformer, Fix TF Long... | \n", "
9 | \n", "8 | \n", "227 | \n", "8_testing_ci_tests_tf | \n", "[testing, ci, tests, tf, test, slow, t5, bench... | \n", "[Fix the CI, Ci test tf super slow, TF Slow te... | \n", "