allenai/scibert_scivocab_uncased

Sep 30, 2022

Dear allen ai

I am trying to use scivocab as a pre-trained model for some topic modelling on scientific papers.
Unfortunately, I cannot download scivocab using SentenceTransformers, and the transformers.pipelines won't work either, since there is no specified pipeline type.

How do you suggest usage in python?

cornpopper

Oct 18, 2022

From here https://github.com/allenai/scibert and there's a .sh to get you started. Something like...

from transformers import *
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

owood

Nov 1, 2022

Thanks. It also works if I skip the pipeline i.e.

from transformers import *
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased') 
embed_model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

However, I am unsure if BERTopic is actually using it, or just defaulting to . When I run
topic_model = BERTopic(embedding_model=embed_model, language="english", nr_topics="auto", verbose=True )
topics, probs = topic_model.fit_transform(docs)

the verbose output is:

loading configuration file `.cache\torch\sentence_transformers\sentence-transformers_all-MiniLM-L6-v2\config.json
Model config BertConfig {
  "_name_or_path": ".cache\\torch\\sentence_transformers\\sentence-transformers_all-MiniLM-L6-v2\\",  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 384,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file .cache\torch\sentence_transformers\sentence-transformers_all-MiniLM-L6-v2\pytorch_model.bin
All model checkpoint weights were used when initializing BertModel.

All the weights of BertModel were initialized from the model checkpoint at C:\Users\oskar/.cache\torch\sentence_transformers\sentence-transformers_all-MiniLM-L6-v2\.

THis might of course be an issue in the Bertopic package.

SteadySurfdom

Nov 1, 2023

I have downloaded the recommended Tensorflow model from the GitHub ReadMe file. But I am not sure how to use that model in Python now. I am new to using BERT and its derivatives, so any help would be appreciated.

allenai
/

scibert_scivocab_uncased

Usage in python