Usage in python #2

by owood - opened

Dear allen ai

I am trying to use scivocab as a pre-trained model for some topic modelling on scientific papers.
Unfortunately, I cannot download scivocab using SentenceTransformers, and the transformers.pipelines won't work either, since there is no specified pipeline type.

How do you suggest usage in python?

From here and there's a .sh to get you started. Something like...

from transformers import *
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

Thanks. It also works if I skip the pipeline i.e.

from transformers import *
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased') 
embed_model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

However, I am unsure if BERTopic is actually using it, or just defaulting to . When I run
topic_model = BERTopic(embedding_model=embed_model, language="english", nr_topics="auto", verbose=True )
topics, probs = topic_model.fit_transform(docs)

the verbose output is:

loading configuration file `.cache\torch\sentence_transformers\sentence-transformers_all-MiniLM-L6-v2\config.json
Model config BertConfig {
  "_name_or_path": ".cache\\torch\\sentence_transformers\\sentence-transformers_all-MiniLM-L6-v2\\",  "architectures": [
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 384,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522

loading weights file .cache\torch\sentence_transformers\sentence-transformers_all-MiniLM-L6-v2\pytorch_model.bin
All model checkpoint weights were used when initializing BertModel.

All the weights of BertModel were initialized from the model checkpoint at C:\Users\oskar/.cache\torch\sentence_transformers\sentence-transformers_all-MiniLM-L6-v2\.

THis might of course be an issue in the Bertopic package.