--- language: - en tags: - topic labeling license: apache-2.0 metrics: - ndcg --- # MyModel ## Model description This is the `BART-TL-all` model from the paper [BART-TL: Weakly-Supervised Topic Label Generation](https://www.aclweb.org/anthology/2021.eacl-main.121.pdf). We aim to solve the topic labeling task using generative methods, rather than selection from a pool of labels as was done in previous State of the Art works. For more details not covered here, you can read the paper or look at the open-source implementation: https://github.com/CristianViorelPopa/BART-TL-topic-label-generation. There are two models made available from the paper: * [BART-TL-all](https://huggingface.co/cristian-popa/bart-tl-all) * [BART-TL-ng](https://huggingface.co/cristian-popa/bart-tl-ng) ## Intended uses & limitations #### How to use The model takes in a topic, represented as a space-separated series of words. Such topics can be generated using LDA, as was done for gathering the fine-tuning dataset for the model. ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM mname = "cristian-popa/bart-tl-all" tokenizer = AutoTokenizer.from_pretrained(mname) model = AutoModelForSeq2SeqLM.from_pretrained(mname) input = "site web google search website online internet social content user" enc = tokenizer(input, return_tensors="pt", truncation=True, padding="max_length", max_length=128) outputs = model.generate( input_ids=enc.input_ids, attention_mask=enc.attention_mask, max_length=15, min_length=1, do_sample=False, num_beams=25, length_penalty=1.0, repetition_penalty=1.5 ) decoded = tokenizer.decode(outputs[0], skip_special_tokens=True) print(decoded) # application programming interface ``` #### Limitations and bias The model may not generate accurate labels for topics from domains unrelated to the ones it was fine-tuned on, such as gastronomy. ## Training data The model was fine-tuned on 5 different StackExchange corpora (see https://archive.org/download/stackexchange for a full list of existing such corpora): English, biology, economics, law, and photography. 100 topics are extracted using LDA for each of these corpora, filtered for coherence and then used for obtaining the final model here. ## Training procedure The large Facebook BART model is fine-tuned in a weakly-supervised manner, making use of the unsupervised candidate selection of the [NETL](https://www.aclweb.org/anthology/C16-1091.pdf) method, along with other heuristic labels, such as n-grams from the topics, relevant sentences in the corpora and noun phrases. The dataset is a one-to-many mapping from topics to labels. More details on training and parameters can be found in the [paper](https://www.aclweb.org/anthology/2021.eacl-main.121.pdf) or by following [this notebook](https://github.com/CristianViorelPopa/BART-TL-topic-label-generation/blob/main/notebooks/end_to_end_workflow.ipynb). ## Eval results model | Top-1 Avg. | Top-3 Avg. | Top-5 Avg. | nDCG-1 | nDCG-3 | nDCG-5 ------------|------------|------------|------------|--------|--------|------- NETL (U) | 2.66 | 2.59 | 2.50 | 0.83 | 0.85 | 0.87 NETL (S) | 2.74 | 2.57 | 2.49 | 0.88 | 0.85 | 0.88 BART-TL-all | 2.64 | 2.52 | 2.43 | 0.83 | 0.84 | 0.87 BART-TL-ng | 2.62 | 2.50 | 2.33 | 0.82 | 0.84 | 0.85 ### BibTeX entry and citation info ```bibtex @inproceedings{popa-rebedea-2021-bart, title = "{BART}-{TL}: Weakly-Supervised Topic Label Generation", author = "Popa, Cristian and Rebedea, Traian", booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume", month = apr, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2021.eacl-main.121", pages = "1418--1425", abstract = "We propose a novel solution for assigning labels to topic models by using multiple weak labelers. The method leverages generative transformers to learn accurate representations of the most important topic terms and candidate labels. This is achieved by fine-tuning pre-trained BART models on a large number of potential labels generated by state of the art non-neural models for topic labeling, enriched with different techniques. The proposed BART-TL model is able to generate valuable and novel labels in a weakly-supervised manner and can be improved by adding other weak labelers or distant supervision on similar tasks.", } ```