MyModel

Model description

This is the BART-TL-all model from the paper BART-TL: Weakly-Supervised Topic Label Generation. We aim to solve the topic labeling task using generative methods, rather than selection from a pool of labels as was done in previous State of the Art works.

For more details not covered here, you can read the paper or look at the open-source implementation: https://github.com/CristianViorelPopa/BART-TL-topic-label-generation.

There are two models made available from the paper:

Intended uses & limitations

How to use

The model takes in a topic, represented as a space-separated series of words. Such topics can be generated using LDA, as was done for gathering the fine-tuning dataset for the model.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

mname = "cristian-popa/bart-tl-all"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

input = "site web google search website online internet social content user"
enc = tokenizer(input, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
outputs = model.generate(
    input_ids=enc.input_ids,
    attention_mask=enc.attention_mask,
    max_length=15,
    min_length=1,
    do_sample=False,
    num_beams=25,
    length_penalty=1.0,
    repetition_penalty=1.5
)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) # application programming interface

Limitations and bias

The model may not generate accurate labels for topics from domains unrelated to the ones it was fine-tuned on, such as gastronomy.

Training data

The model was fine-tuned on 5 different StackExchange corpora (see https://archive.org/download/stackexchange for a full list of existing such corpora): English, biology, economics, law, and photography. 100 topics are extracted using LDA for each of these corpora, filtered for coherence and then used for obtaining the final model here.

Training procedure

The large Facebook BART model is fine-tuned in a weakly-supervised manner, making use of the unsupervised candidate selection of the NETL method, along with other heuristic labels, such as n-grams from the topics, relevant sentences in the corpora and noun phrases. The dataset is a one-to-many mapping from topics to labels. More details on training and parameters can be found in the paper or by following this notebook.

Eval results

model	Top-1 Avg.	Top-3 Avg.	Top-5 Avg.	nDCG-1	nDCG-3	nDCG-5
NETL (U)	2.66	2.59	2.50	0.83	0.85	0.87
NETL (S)	2.74	2.57	2.49	0.88	0.85	0.88
BART-TL-all	2.64	2.52	2.43	0.83	0.84	0.87
BART-TL-ng	2.62	2.50	2.33	0.82	0.84	0.85

BibTeX entry and citation info

@inproceedings{popa-rebedea-2021-bart,
    title = "{BART}-{TL}: Weakly-Supervised Topic Label Generation",
    author = "Popa, Cristian  and
      Rebedea, Traian",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-main.121",
    pages = "1418--1425",
    abstract = "We propose a novel solution for assigning labels to topic models by using multiple weak labelers. The method leverages generative transformers to learn accurate representations of the most important topic terms and candidate labels. This is achieved by fine-tuning pre-trained BART models on a large number of potential labels generated by state of the art non-neural models for topic labeling, enriched with different techniques. The proposed BART-TL model is able to generate valuable and novel labels in a weakly-supervised manner and can be improved by adding other weak labelers or distant supervision on similar tasks.",
}