--- language: - en pipeline_tag: sentence-similarity --- # Model Card for gowitheflow/LASER-cubed-bert-base-unsup Official model checkpoints of **LA(SER)3** (LASER-cubed) from EMNLP 2023 paper "Length is a Curse and a Blessing for Document-level Semantics" ### Model Summary LASER-cubed-bert-base-unsup is an **unsupervised** model trained on wiki1M dataset. Without needing the training sets to have long texts, it provides surprising generalizability on long document retrieval. - **Developed by:** Chenghao Xiao, Yizhi Li, G Thomas Hudson, Chenghua Lin, Noura Al-Moubayed - **Shared by:** Chenghao Xiao - **Model type:** BERT-base - **Language(s) (NLP):** English - **Finetuned from model:** BERT-base-uncased ### Model Sources - **Github Repo:** https://github.com/gowitheflow-1998/LA-SER-cubed - **Paper:** https://aclanthology.org/2023.emnlp-main.86/ ### Usage Use the model with Sentence Transformers: ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("gowitheflow/LASER-cubed-bert-base-unsup") text = "LASER-cubed is a dope model - It generalizes to long texts without needing the training sets to have long texts." representation = model.encode(text) ``` ### Evaluation Evaluate it with the BEIR framework: ```python from beir.retrieval import models from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES # download the datasets with BEIR original repo youself first data_path = './datasets/arguana' corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test") model = DRES(models.SentenceBERT("gowitheflow/LASER-cubed-bert-base-unsup"), batch_size=512) retriever = EvaluateRetrieval(model, score_function="cos_sim") results = retriever.retrieve(corpus, queries) ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values) ``` ### Downstream Use Information Retrieval ### Out-of-Scope Use The model is not for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching. ## Training Details max seq 256, batch size 128, lr 3e-05, 1 epoch, 10% warmup, 1 A100. ### Training Data wiki 1M ### Training Procedure Please refer to the paper. ## Evaluation ### Results **BibTeX:** ```bibtex @inproceedings{xiao2023length, title={Length is a Curse and a Blessing for Document-level Semantics}, author={Xiao, Chenghao and Li, Yizhi and Hudson, G and Lin, Chenghua and Al Moubayed, Noura}, booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}, pages={1385--1396}, year={2023} } ```