Spaces:

nickmuchi
/

Earnings-Call-Analysis-Whisperer

Running

App Files Files Community

nickmuchi commited on Jan 4, 2023

Commit

50dd923

•

1 Parent(s): 7fcc2a5

Upload 17 files

Browse files

Files changed (17) hide show

sentence-transformers/.DS_Store +0 -0
sentence-transformers/CODE_OF_CONDUCT.md +5 -0
sentence-transformers/CONTRIBUTING.md +16 -0
sentence-transformers/LICENSE +201 -0
sentence-transformers/NOTICE.txt +5 -0
sentence-transformers/README.md +182 -0
sentence-transformers/eval_beir.py +89 -0
sentence-transformers/evaluate_retrieved_passages.py +66 -0
sentence-transformers/finetuning.py +249 -0
sentence-transformers/generate_passage_embeddings.py +124 -0
sentence-transformers/index.rst +189 -0
sentence-transformers/passage_retrieval.py +249 -0
sentence-transformers/preprocess.py +68 -0
sentence-transformers/requirements.txt +11 -0
sentence-transformers/setup.cfg +2 -0
sentence-transformers/setup.py +41 -0
sentence-transformers/train.py +195 -0

sentence-transformers/.DS_Store ADDED Viewed

Binary file (8.2 kB). View file

sentence-transformers/CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Code of Conduct
+Facebook has adopted a Code of Conduct that we expect project participants to adhere to.
+Please read the [full text](https://code.fb.com/codeofconduct/)
+so that you can understand what actions will and will not be tolerated.

sentence-transformers/CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# Contributing to this repo
+## Pull Requests
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+Complete your CLA here: <https://code.facebook.com/cla>
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+## License
+By contributing to this repo, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.

sentence-transformers/LICENSE ADDED Viewed

	@@ -0,0 +1,201 @@

+                                Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "{}"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2019 Nils Reimers
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+limitations under the License.

sentence-transformers/NOTICE.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+-------------------------------------------------------------------------------
+Copyright 2019
+Ubiquitous Knowledge Processing (UKP) Lab
+Technische Universität Darmstadt
+-------------------------------------------------------------------------------

sentence-transformers/README.md ADDED Viewed

	@@ -0,0 +1,182 @@

+<!--- BADGES: START --->
+[![GitHub - License](https://img.shields.io/github/license/UKPLab/sentence-transformers?logo=github&style=flat&color=green)][#github-license]
+[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sentence-transformers?logo=pypi&style=flat&color=blue)][#pypi-package]
+[![PyPI - Package Version](https://img.shields.io/pypi/v/sentence-transformers?logo=pypi&style=flat&color=orange)][#pypi-package]
+[![Conda - Platform](https://img.shields.io/conda/pn/conda-forge/sentence-transformers?logo=anaconda&style=flat)][#conda-forge-package]
+[![Conda (channel only)](https://img.shields.io/conda/vn/conda-forge/sentence-transformers?logo=anaconda&style=flat&color=orange)][#conda-forge-package]
+[![Docs - GitHub.io](https://img.shields.io/static/v1?logo=github&style=flat&color=pink&label=docs&message=sentence-transformers)][#docs-package]
+<!---
+[![PyPI - Downloads](https://img.shields.io/pypi/dm/sentence-transformers?logo=pypi&style=flat&color=green)][#pypi-package]
+[![Conda](https://img.shields.io/conda/dn/conda-forge/sentence-transformers?logo=anaconda)][#conda-forge-package]
+--->
+[#github-license]: https://github.com/UKPLab/sentence-transformers/blob/master/LICENSE
+[#pypi-package]: https://pypi.org/project/sentence-transformers/
+[#conda-forge-package]: https://anaconda.org/conda-forge/sentence-transformers
+[#docs-package]: https://www.sbert.net/
+<!--- BADGES: END --->
+# Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.
+This framework provides an easy method to compute dense vector representations for **sentences**, **paragraphs**, and **images**. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity.
+We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases.
+Further, this framework allows an easy  **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/training/overview.html)**, to achieve maximal performance on your specific task.
+For the **full documentation**, see **[www.SBERT.net](https://www.sbert.net)**.
+The following publications are integrated in this framework:
+- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019)
+- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020)
+- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (NAACL 2021)
+- [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes](https://arxiv.org/abs/2012.14210) (arXiv 2020)
+- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979) (arXiv 2021)
+- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) (arXiv 2021)
+## Installation
+We recommend **Python 3.6** or higher, **[PyTorch 1.6.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.6.0](https://github.com/huggingface/transformers)** or higher. The code does **not** work with Python 2.7.
+**Install with pip**
+Install the *sentence-transformers* with `pip`:
+```
+pip install -U sentence-transformers
+```
+**Install with conda**
+You can install the *sentence-transformers* with `conda`:
+```
+conda install -c conda-forge sentence-transformers
+```
+**Install from sources**
+Alternatively, you can also clone the latest version from the [repository](https://github.com/UKPLab/sentence-transformers) and install it directly from the source code:
+````
+pip install -e .
+````
+**PyTorch with CUDA**
+If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow
+[PyTorch - Get Started](https://pytorch.org/get-started/locally/) for further details how to install PyTorch.
+## Getting Started
+See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation.
+[This example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/computing-embeddings/computing_embeddings.py) shows you how to use an already trained Sentence Transformer model to embed sentences for another task.
+First download a pretrained model.
+````python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('all-MiniLM-L6-v2')
+````
+Then provide some sentences to the model.
+````python
+sentences = ['This framework generates embeddings for each input sentence',
+    'Sentences are passed as a list of string.',
+    'The quick brown fox jumps over the lazy dog.']
+sentence_embeddings = model.encode(sentences)
+````
+And that's it already. We now have a list of numpy arrays with the embeddings.
+````python
+for sentence, embedding in zip(sentences, sentence_embeddings):
+    print("Sentence:", sentence)
+    print("Embedding:", embedding)
+    print("")
+````
+## Pre-Trained Models
+We provide a large list of [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`.
+[»  Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)
+## Training
+This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.
+See [Training Overview](https://www.sbert.net/docs/training/overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets.
+Some highlights are:
+- Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
+- Multi-Lingual and multi-task learning
+- Evaluation during training to find optimal model
+- [10+ loss-functions](https://www.sbert.net/docs/package_reference/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss.
+## Performance
+Our models are evaluated extensively on 15+ datasets including challening domains like Tweets, Reddit, emails. They achieve by far the **best performance** from all available sentence embedding methods. Further, we provide several **smaller models** that are **optimized for speed**.
+[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)
+## Application Examples
+You can use this framework for:
+- [Computing Sentence Embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html)
+- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
+- [Clustering](https://www.sbert.net/examples/applications/clustering/README.html)
+- [Paraphrase Mining](https://www.sbert.net/examples/applications/paraphrase-mining/README.html)
+ - [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html)
+ - [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
+ - [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)
+ - [Text Summarization](https://www.sbert.net/examples/applications/text-summarization/README.html)
+- [Multilingual Image Search, Clustering & Duplicate Detection](https://www.sbert.net/examples/applications/image-search/README.html)
+and many more use-cases.
+For all examples, see [examples/applications](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications).
+## Citing & Authors
+If you find this repository helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+If you use one of the multilingual models, feel free to cite our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813):
+```bibtex
+@inproceedings{reimers-2020-multilingual-sentence-bert,
+    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2020",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/2004.09813",
+}
+```
+Please have a look at [Publications](https://www.sbert.net/docs/publications.html) for our different publications that are integrated into SentenceTransformers.
+Contact person: [Nils Reimers](https://www.nils-reimers.de), [info@nils-reimers.de](mailto:info@nils-reimers.de)
+https://www.ukp.tu-darmstadt.de/
+Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
+> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

sentence-transformers/eval_beir.py ADDED Viewed

	@@ -0,0 +1,89 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import sys
+import argparse
+import torch
+import logging
+import json
+import numpy as np
+import os
+import src.slurm
+import src.contriever
+import src.beir_utils
+import src.utils
+import src.dist_utils
+import src.contriever
+logger = logging.getLogger(__name__)
+def main(args):
+    src.slurm.init_distributed_mode(args)
+    src.slurm.init_signal_handler()
+    os.makedirs(args.output_dir, exist_ok=True)
+    logger = src.utils.init_logger(args)
+    model, tokenizer, _ = src.contriever.load_retriever(args.model_name_or_path)
+    model = model.cuda()
+    model.eval()
+    query_encoder = model
+    doc_encoder = model
+    logger.info("Start indexing")
+    metrics = src.beir_utils.evaluate_model(
+        query_encoder=query_encoder,
+        doc_encoder=doc_encoder,
+        tokenizer=tokenizer,
+        dataset=args.dataset,
+        batch_size=args.per_gpu_batch_size,
+        norm_query=args.norm_query,
+        norm_doc=args.norm_doc,
+        is_main=src.dist_utils.is_main(),
+        split="dev" if args.dataset == "msmarco" else "test",
+        score_function=args.score_function,
+        beir_dir=args.beir_dir,
+        save_results_path=args.save_results_path,
+        lower_case=args.lower_case,
+        normalize_text=args.normalize_text,
+    )
+    if src.dist_utils.is_main():
+        for key, value in metrics.items():
+            logger.info(f"{args.dataset} : {key}: {value:.1f}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser.add_argument("--dataset", type=str, help="Evaluation dataset from the BEIR benchmark")
+    parser.add_argument("--beir_dir", type=str, default="./", help="Directory to save and load beir datasets")
+    parser.add_argument("--text_maxlength", type=int, default=512, help="Maximum text length")
+    parser.add_argument("--per_gpu_batch_size", default=128, type=int, help="Batch size per GPU/CPU for indexing.")
+    parser.add_argument("--output_dir", type=str, default="./my_experiment", help="Output directory")
+    parser.add_argument("--model_name_or_path", type=str, help="Model name or path")
+    parser.add_argument(
+        "--score_function", type=str, default="dot", help="Metric used to compute similarity between two embeddings"
+    )
+    parser.add_argument("--norm_query", action="store_true", help="Normalize query representation")
+    parser.add_argument("--norm_doc", action="store_true", help="Normalize document representation")
+    parser.add_argument("--lower_case", action="store_true", help="lowercase query and document text")
+    parser.add_argument(
+        "--normalize_text", action="store_true", help="Apply function to normalize some common characters"
+    )
+    parser.add_argument("--save_results_path", type=str, default=None, help="Path to save result object")
+    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
+    parser.add_argument("--main_port", type=int, default=-1, help="Main port (for multi-node SLURM jobs)")
+    args, _ = parser.parse_known_args()
+    main(args)

sentence-transformers/evaluate_retrieved_passages.py ADDED Viewed

	@@ -0,0 +1,66 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import json
+import logging
+import glob
+import numpy as np
+import torch
+import src.utils
+from src.evaluation import calculate_matches
+logger = logging.getLogger(__name__)
+def validate(data, workers_num):
+    match_stats = calculate_matches(data, workers_num)
+    top_k_hits = match_stats.top_k_hits
+    #logger.info('Validation results: top k documents hits %s', top_k_hits)
+    top_k_hits = [v / len(data) for v in top_k_hits]
+    #logger.info('Validation results: top k documents hits accuracy %s', top_k_hits)
+    return top_k_hits
+def main(opt):
+    logger = src.utils.init_logger(opt, stdout_only=True)
+    datapaths = glob.glob(args.data)
+    r20, r100 = [], []
+    for path in datapaths:
+        data = []
+        with open(path, 'r') as fin:
+            for line in fin:
+                data.append(json.loads(line))
+            #data = json.load(fin)
+        answers = [ex['answers'] for ex in data]
+        top_k_hits = validate(data, args.validation_workers)
+        message = f"Evaluate results from {path}:"
+        for k in [5, 10, 20, 100]:
+            if k <= len(top_k_hits):
+                recall = 100 * top_k_hits[k-1]
+                if k == 20:
+                    r20.append(f"{recall:.1f}")
+                if k == 100:
+                    r100.append(f"{recall:.1f}")
+                message += f' R@{k}: {recall:.1f}'
+        logger.info(message)
+    print(datapaths)
+    print('\t'.join(r20))
+    print('\t'.join(r100))
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--data', required=True, type=str, default=None)
+    parser.add_argument('--validation_workers', type=int, default=16,
+                        help="Number of parallel processes to validate results")
+    args = parser.parse_args()
+    main(args)

sentence-transformers/finetuning.py ADDED Viewed

	@@ -0,0 +1,249 @@

+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+import pdb
+import os
+import time
+import sys
+import torch
+from torch.utils.tensorboard import SummaryWriter
+import logging
+import json
+import numpy as np
+import torch.distributed as dist
+from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
+from src.options import Options
+from src import data, beir_utils, slurm, dist_utils, utils, contriever, finetuning_data, inbatch
+import train
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+logger = logging.getLogger(__name__)
+def finetuning(opt, model, optimizer, scheduler, tokenizer, step):
+    run_stats = utils.WeightedAvgStats()
+    tb_logger = utils.init_tb_logger(opt.output_dir)
+    if hasattr(model, "module"):
+        eval_model = model.module
+    else:
+        eval_model = model
+    eval_model = eval_model.get_encoder()
+    train_dataset = finetuning_data.Dataset(
+        datapaths=opt.train_data,
+        negative_ctxs=opt.negative_ctxs,
+        negative_hard_ratio=opt.negative_hard_ratio,
+        negative_hard_min_idx=opt.negative_hard_min_idx,
+        normalize=opt.eval_normalize_text,
+        global_rank=dist_utils.get_rank(),
+        world_size=dist_utils.get_world_size(),
+        maxload=opt.maxload,
+        training=True,
+    )
+    collator = finetuning_data.Collator(tokenizer, passage_maxlength=opt.chunk_length)
+    train_sampler = RandomSampler(train_dataset)
+    train_dataloader = DataLoader(
+        train_dataset,
+        sampler=train_sampler,
+        batch_size=opt.per_gpu_batch_size,
+        drop_last=True,
+        num_workers=opt.num_workers,
+        collate_fn=collator,
+    )
+    train.eval_model(opt, eval_model, None, tokenizer, tb_logger, step)
+    evaluate(opt, eval_model, tokenizer, tb_logger, step)
+    epoch = 1
+    model.train()
+    prev_ids, prev_mask = None, None
+    while step < opt.total_steps:
+        logger.info(f"Start epoch {epoch}, number of batches: {len(train_dataloader)}")
+        for i, batch in enumerate(train_dataloader):
+            batch = {key: value.cuda() if isinstance(value, torch.Tensor) else value for key, value in batch.items()}
+            step += 1
+            train_loss, iter_stats = model(**batch, stats_prefix="train")
+            train_loss.backward()
+            if opt.optim == "sam" or opt.optim == "asam":
+                optimizer.first_step(zero_grad=True)
+                sam_loss, _ = model(**batch, stats_prefix="train/sam_opt")
+                sam_loss.backward()
+                optimizer.second_step(zero_grad=True)
+            else:
+                optimizer.step()
+            scheduler.step()
+            optimizer.zero_grad()
+            run_stats.update(iter_stats)
+            if step % opt.log_freq == 0:
+                log = f"{step} / {opt.total_steps}"
+                for k, v in sorted(run_stats.average_stats.items()):
+                    log += f" | {k}: {v:.3f}"
+                    if tb_logger:
+                        tb_logger.add_scalar(k, v, step)
+                log += f" | lr: {scheduler.get_last_lr()[0]:0.3g}"
+                log += f" | Memory: {torch.cuda.max_memory_allocated()//1e9} GiB"
+                logger.info(log)
+                run_stats.reset()
+            if step % opt.eval_freq == 0:
+                train.eval_model(opt, eval_model, None, tokenizer, tb_logger, step)
+                evaluate(opt, eval_model, tokenizer, tb_logger, step)
+                if step % opt.save_freq == 0 and dist_utils.get_rank() == 0:
+                    utils.save(
+                        eval_model,
+                        optimizer,
+                        scheduler,
+                        step,
+                        opt,
+                        opt.output_dir,
+                        f"step-{step}",
+                    )
+                model.train()
+            if step >= opt.total_steps:
+                break
+        epoch += 1
+def evaluate(opt, model, tokenizer, tb_logger, step):
+    dataset = finetuning_data.Dataset(
+        datapaths=opt.eval_data,
+        normalize=opt.eval_normalize_text,
+        global_rank=dist_utils.get_rank(),
+        world_size=dist_utils.get_world_size(),
+        maxload=opt.maxload,
+        training=False,
+    )
+    collator = finetuning_data.Collator(tokenizer, passage_maxlength=opt.chunk_length)
+    sampler = SequentialSampler(dataset)
+    dataloader = DataLoader(
+        dataset,
+        sampler=sampler,
+        batch_size=opt.per_gpu_batch_size,
+        drop_last=False,
+        num_workers=opt.num_workers,
+        collate_fn=collator,
+    )
+    model.eval()
+    if hasattr(model, "module"):
+        model = model.module
+    correct_samples, total_samples, total_step = 0, 0, 0
+    all_q, all_g, all_n = [], [], []
+    with torch.no_grad():
+        for i, batch in enumerate(dataloader):
+            batch = {key: value.cuda() if isinstance(value, torch.Tensor) else value for key, value in batch.items()}
+            all_tokens = torch.cat([batch["g_tokens"], batch["n_tokens"]], dim=0)
+            all_mask = torch.cat([batch["g_mask"], batch["n_mask"]], dim=0)
+            q_emb = model(input_ids=batch["q_tokens"], attention_mask=batch["q_mask"], normalize=opt.norm_query)
+            all_emb = model(input_ids=all_tokens, attention_mask=all_mask, normalize=opt.norm_doc)
+            g_emb, n_emb = torch.split(all_emb, [len(batch["g_tokens"]), len(batch["n_tokens"])])
+            all_q.append(q_emb)
+            all_g.append(g_emb)
+            all_n.append(n_emb)
+        all_q = torch.cat(all_q, dim=0)
+        all_g = torch.cat(all_g, dim=0)
+        all_n = torch.cat(all_n, dim=0)
+        labels = torch.arange(0, len(all_q), device=all_q.device, dtype=torch.long)
+        all_sizes = dist_utils.get_varsize(all_g)
+        all_g = dist_utils.varsize_gather_nograd(all_g)
+        all_n = dist_utils.varsize_gather_nograd(all_n)
+        labels = labels + sum(all_sizes[: dist_utils.get_rank()])
+        scores_pos = torch.einsum("id, jd->ij", all_q, all_g)
+        scores_neg = torch.einsum("id, jd->ij", all_q, all_n)
+        scores = torch.cat([scores_pos, scores_neg], dim=-1)
+        argmax_idx = torch.argmax(scores, dim=1)
+        sorted_scores, indices = torch.sort(scores, descending=True)
+        isrelevant = indices == labels[:, None]
+        rs = [r.cpu().numpy().nonzero()[0] for r in isrelevant]
+        mrr = np.mean([1.0 / (r[0] + 1) if r.size else 0.0 for r in rs])
+        acc = (argmax_idx == labels).sum() / all_q.size(0)
+        acc, total = dist_utils.weighted_average(acc, all_q.size(0))
+        mrr, _ = dist_utils.weighted_average(mrr, all_q.size(0))
+        acc = 100 * acc
+        message = []
+        if dist_utils.is_main():
+            message = [f"eval acc: {acc:.2f}%", f"eval mrr: {mrr:.3f}"]
+            logger.info(" | ".join(message))
+            if tb_logger is not None:
+                tb_logger.add_scalar(f"eval_acc", acc, step)
+                tb_logger.add_scalar(f"mrr", mrr, step)
+def main():
+    logger.info("Start")
+    options = Options()
+    opt = options.parse()
+    torch.manual_seed(opt.seed)
+    slurm.init_distributed_mode(opt)
+    slurm.init_signal_handler()
+    directory_exists = os.path.isdir(opt.output_dir)
+    if dist.is_initialized():
+        dist.barrier()
+    os.makedirs(opt.output_dir, exist_ok=True)
+    if not directory_exists and dist_utils.is_main():
+        options.print_options(opt)
+    if dist.is_initialized():
+        dist.barrier()
+    utils.init_logger(opt)
+    step = 0
+    retriever, tokenizer, retriever_model_id = contriever.load_retriever(opt.model_path, opt.pooling, opt.random_init)
+    opt.retriever_model_id = retriever_model_id
+    model = inbatch.InBatch(opt, retriever, tokenizer)
+    model = model.cuda()
+    optimizer, scheduler = utils.set_optim(opt, model)
+    # if dist_utils.is_main():
+    #    utils.save(model, optimizer, scheduler, global_step, 0., opt, opt.output_dir, f"step-{0}")
+    logger.info(utils.get_parameters(model))
+    for name, module in model.named_modules():
+        if isinstance(module, torch.nn.Dropout):
+            module.p = opt.dropout
+    if torch.distributed.is_initialized():
+        model = torch.nn.parallel.DistributedDataParallel(
+            model,
+            device_ids=[opt.local_rank],
+            output_device=opt.local_rank,
+            find_unused_parameters=False,
+        )
+    logger.info("Start training")
+    finetuning(opt, model, optimizer, scheduler, tokenizer, step)
+if __name__ == "__main__":
+    main()

sentence-transformers/generate_passage_embeddings.py ADDED Viewed

	@@ -0,0 +1,124 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import argparse
+import csv
+import logging
+import pickle
+import numpy as np
+import torch
+import transformers
+import src.slurm
+import src.contriever
+import src.utils
+import src.data
+import src.normalize_text
+def embed_passages(args, passages, model, tokenizer):
+    total = 0
+    allids, allembeddings = [], []
+    batch_ids, batch_text = [], []
+    with torch.no_grad():
+        for k, p in enumerate(passages):
+            batch_ids.append(p["id"])
+            if args.no_title or not "title" in p:
+                text = p["text"]
+            else:
+                text = p["title"] + " " + p["text"]
+            if args.lowercase:
+                text = text.lower()
+            if args.normalize_text:
+                text = src.normalize_text.normalize(text)
+            batch_text.append(text)
+            if len(batch_text) == args.per_gpu_batch_size or k == len(passages) - 1:
+                encoded_batch = tokenizer.batch_encode_plus(
+                    batch_text,
+                    return_tensors="pt",
+                    max_length=args.passage_maxlength,
+                    padding=True,
+                    truncation=True,
+                )
+                encoded_batch = {k: v.cuda() for k, v in encoded_batch.items()}
+                embeddings = model(**encoded_batch)
+                embeddings = embeddings.cpu()
+                total += len(batch_ids)
+                allids.extend(batch_ids)
+                allembeddings.append(embeddings)
+                batch_text = []
+                batch_ids = []
+                if k % 100000 == 0 and k > 0:
+                    print(f"Encoded passages {total}")
+    allembeddings = torch.cat(allembeddings, dim=0).numpy()
+    return allids, allembeddings
+def main(args):
+    model, tokenizer, _ = src.contriever.load_retriever(args.model_name_or_path)
+    print(f"Model loaded from {args.model_name_or_path}.", flush=True)
+    model.eval()
+    model = model.cuda()
+    if not args.no_fp16:
+        model = model.half()
+    passages = src.data.load_passages(args.passages)
+    shard_size = len(passages) // args.num_shards
+    start_idx = args.shard_id * shard_size
+    end_idx = start_idx + shard_size
+    if args.shard_id == args.num_shards - 1:
+        end_idx = len(passages)
+    passages = passages[start_idx:end_idx]
+    print(f"Embedding generation for {len(passages)} passages from idx {start_idx} to {end_idx}.")
+    allids, allembeddings = embed_passages(args, passages, model, tokenizer)
+    save_file = os.path.join(args.output_dir, args.prefix + f"_{args.shard_id:02d}")
+    os.makedirs(args.output_dir, exist_ok=True)
+    print(f"Saving {len(allids)} passage embeddings to {save_file}.")
+    with open(save_file, mode="wb") as f:
+        pickle.dump((allids, allembeddings), f)
+    print(f"Total passages processed {len(allids)}. Written to {save_file}.")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--passages", type=str, default=None, help="Path to passages (.tsv file)")
+    parser.add_argument("--output_dir", type=str, default="wikipedia_embeddings", help="dir path to save embeddings")
+    parser.add_argument("--prefix", type=str, default="passages", help="prefix path to save embeddings")
+    parser.add_argument("--shard_id", type=int, default=0, help="Id of the current shard")
+    parser.add_argument("--num_shards", type=int, default=1, help="Total number of shards")
+    parser.add_argument(
+        "--per_gpu_batch_size", type=int, default=512, help="Batch size for the passage encoder forward pass"
+    )
+    parser.add_argument("--passage_maxlength", type=int, default=512, help="Maximum number of tokens in a passage")
+    parser.add_argument(
+        "--model_name_or_path", type=str, help="path to directory containing model weights and config file"
+    )
+    parser.add_argument("--no_fp16", action="store_true", help="inference in fp32")
+    parser.add_argument("--no_title", action="store_true", help="title not added to the passage body")
+    parser.add_argument("--lowercase", action="store_true", help="lowercase text before encoding")
+    parser.add_argument("--normalize_text", action="store_true", help="lowercase text before encoding")
+    args = parser.parse_args()
+    src.slurm.init_distributed_mode(args)
+    main(args)

sentence-transformers/index.rst ADDED Viewed

	@@ -0,0 +1,189 @@

+SentenceTransformers Documentation
+=================================================
+SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper `Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks <https://arxiv.org/abs/1908.10084>`_.
+You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for `semantic textual similar <docs/usage/semantic_textual_similarity.html>`_, `semantic search <examples/applications/semantic-search/README.html>`_, or `paraphrase mining <examples/applications/paraphrase-mining/README.html>`_.
+The framework is based on `PyTorch <https://pytorch.org/>`_ and `Transformers <https://huggingface.co/transformers/>`_ and offers a large collection of `pre-trained models <docs/pretrained_models.html>`_ tuned for various tasks. Further, it is easy to `fine-tune your own models <docs/training/overview.html>`_.
+Installation
+=================================================
+You can install it using pip:
+.. code-block:: python
+   pip install -U sentence-transformers
+We recommend **Python 3.6** or higher, and at least **PyTorch 1.6.0**. See `installation <docs/installation.html>`_ for further installation options, especially if you want to use a GPU.
+Usage
+=================================================
+The usage is as simple as:
+.. code-block:: python
+    from sentence_transformers import SentenceTransformer
+    model = SentenceTransformer('all-MiniLM-L6-v2')
+    #Our sentences we like to encode
+    sentences = ['This framework generates embeddings for each input sentence',
+        'Sentences are passed as a list of string.',
+        'The quick brown fox jumps over the lazy dog.']
+    #Sentences are encoded by calling model.encode()
+    embeddings = model.encode(sentences)
+    #Print the embeddings
+    for sentence, embedding in zip(sentences, embeddings):
+        print("Sentence:", sentence)
+        print("Embedding:", embedding)
+        print("")
+Performance
+=========================
+Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed. Have a look at `Pre-Trained Models <https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/>`_ for an overview of available models and the respective performance on different tasks.
+Contact
+=========================
+Contact person: Nils Reimers, info@nils-reimers.de
+https://www.ukp.tu-darmstadt.de/
+Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
+*This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.*
+Citing & Authors
+=========================
+If you find this repository helpful, feel free to cite our publication `Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks <https://arxiv.org/abs/1908.10084>`_:
+ .. code-block:: bibtex
+  @inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+  }
+If you use one of the multilingual models, feel free to cite our publication `Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation <https://arxiv.org/abs/2004.09813>`_:
+ .. code-block:: bibtex
+  @inproceedings{reimers-2020-multilingual-sentence-bert,
+    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2020",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/2004.09813",
+  }
+If you use the code for `data augmentation <https://github.com/UKPLab/sentence-transformers/tree/master/examples/training/data_augmentation>`_, feel free to cite our publication `Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks <https://arxiv.org/abs/2010.08240>`_:
+ .. code-block:: bibtex
+  @inproceedings{thakur-2020-AugSBERT,
+    title = "Augmented {SBERT}: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
+    author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes  and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
+    month = jun,
+    year = "2021",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/2021.naacl-main.28",
+    pages = "296--310",
+  }
+.. toctree::
+   :maxdepth: 2
+   :caption: Overview
+   docs/installation
+   docs/quickstart
+   docs/pretrained_models
+   docs/pretrained_cross-encoders
+   docs/publications
+   docs/hugging_face
+.. toctree::
+   :maxdepth: 2
+   :caption: Usage
+   examples/applications/computing-embeddings/README
+   docs/usage/semantic_textual_similarity
+   examples/applications/semantic-search/README
+   examples/applications/retrieve_rerank/README
+   examples/applications/clustering/README
+   examples/applications/paraphrase-mining/README
+   examples/applications/parallel-sentence-mining/README
+   examples/applications/cross-encoder/README
+   examples/applications/image-search/README
+.. toctree::
+   :maxdepth: 2
+   :caption: Training
+   docs/training/overview
+   examples/training/multilingual/README
+   examples/training/distillation/README
+   examples/training/cross-encoder/README
+   examples/training/data_augmentation/README
+.. toctree::
+   :maxdepth: 2
+   :caption: Training Examples
+   examples/training/sts/README
+   examples/training/nli/README
+   examples/training/paraphrases/README
+   examples/training/quora_duplicate_questions/README
+   examples/training/ms_marco/README
+.. toctree::
+   :maxdepth: 2
+   :caption: Unsupervised Learning
+   examples/unsupervised_learning/README
+   examples/domain_adaptation/README
+.. toctree::
+   :maxdepth: 1
+   :caption: Package Reference
+   docs/package_reference/SentenceTransformer
+   docs/package_reference/util
+   docs/package_reference/models
+   docs/package_reference/losses
+   docs/package_reference/evaluation
+   docs/package_reference/datasets
+   docs/package_reference/cross_encoder

sentence-transformers/passage_retrieval.py ADDED Viewed

	@@ -0,0 +1,249 @@

+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import argparse
+import csv
+import json
+import logging
+import pickle
+import time
+import glob
+from pathlib import Path
+import numpy as np
+import torch
+import transformers
+import src.index
+import src.contriever
+import src.utils
+import src.slurm
+import src.data
+from src.evaluation import calculate_matches
+import src.normalize_text
+os.environ["TOKENIZERS_PARALLELISM"] = "true"
+def embed_queries(args, queries, model, tokenizer):
+    model.eval()
+    embeddings, batch_question = [], []
+    with torch.no_grad():
+        for k, q in enumerate(queries):
+            if args.lowercase:
+                q = q.lower()
+            if args.normalize_text:
+                q = src.normalize_text.normalize(q)
+            batch_question.append(q)
+            if len(batch_question) == args.per_gpu_batch_size or k == len(queries) - 1:
+                encoded_batch = tokenizer.batch_encode_plus(
+                    batch_question,
+                    return_tensors="pt",
+                    max_length=args.question_maxlength,
+                    padding=True,
+                    truncation=True,
+                )
+                encoded_batch = {k: v.cuda() for k, v in encoded_batch.items()}
+                output = model(**encoded_batch)
+                embeddings.append(output.cpu())
+                batch_question = []
+    embeddings = torch.cat(embeddings, dim=0)
+    print(f"Questions embeddings shape: {embeddings.size()}")
+    return embeddings.numpy()
+def index_encoded_data(index, embedding_files, indexing_batch_size):
+    allids = []
+    allembeddings = np.array([])
+    for i, file_path in enumerate(embedding_files):
+        print(f"Loading file {file_path}")
+        with open(file_path, "rb") as fin:
+            ids, embeddings = pickle.load(fin)
+        allembeddings = np.vstack((allembeddings, embeddings)) if allembeddings.size else embeddings
+        allids.extend(ids)
+        while allembeddings.shape[0] > indexing_batch_size:
+            allembeddings, allids = add_embeddings(index, allembeddings, allids, indexing_batch_size)
+    while allembeddings.shape[0] > 0:
+        allembeddings, allids = add_embeddings(index, allembeddings, allids, indexing_batch_size)
+    print("Data indexing completed.")
+def add_embeddings(index, embeddings, ids, indexing_batch_size):
+    end_idx = min(indexing_batch_size, embeddings.shape[0])
+    ids_toadd = ids[:end_idx]
+    embeddings_toadd = embeddings[:end_idx]
+    ids = ids[end_idx:]
+    embeddings = embeddings[end_idx:]
+    index.index_data(ids_toadd, embeddings_toadd)
+    return embeddings, ids
+def validate(data, workers_num):
+    match_stats = calculate_matches(data, workers_num)
+    top_k_hits = match_stats.top_k_hits
+    print("Validation results: top k documents hits %s", top_k_hits)
+    top_k_hits = [v / len(data) for v in top_k_hits]
+    message = ""
+    for k in [5, 10, 20, 100]:
+        if k <= len(top_k_hits):
+            message += f"R@{k}: {top_k_hits[k-1]} "
+    print(message)
+    return match_stats.questions_doc_hits
+def add_passages(data, passages, top_passages_and_scores):
+    # add passages to original data
+    merged_data = []
+    assert len(data) == len(top_passages_and_scores)
+    for i, d in enumerate(data):
+        results_and_scores = top_passages_and_scores[i]
+        docs = [passages[doc_id] for doc_id in results_and_scores[0]]
+        scores = [str(score) for score in results_and_scores[1]]
+        ctxs_num = len(docs)
+        d["ctxs"] = [
+            {
+                "id": results_and_scores[0][c],
+                "title": docs[c]["title"],
+                "text": docs[c]["text"],
+                "score": scores[c],
+            }
+            for c in range(ctxs_num)
+        ]
+def add_hasanswer(data, hasanswer):
+    # add hasanswer to data
+    for i, ex in enumerate(data):
+        for k, d in enumerate(ex["ctxs"]):
+            d["hasanswer"] = hasanswer[i][k]
+def load_data(data_path):
+    if data_path.endswith(".json"):
+        with open(data_path, "r") as fin:
+            data = json.load(fin)
+    elif data_path.endswith(".jsonl"):
+        data = []
+        with open(data_path, "r") as fin:
+            for k, example in enumerate(fin):
+                example = json.loads(example)
+                data.append(example)
+    return data
+def main(args):
+    print(f"Loading model from: {args.model_name_or_path}")
+    model, tokenizer, _ = src.contriever.load_retriever(args.model_name_or_path)
+    model.eval()
+    model = model.cuda()
+    if not args.no_fp16:
+        model = model.half()
+    index = src.index.Indexer(args.projection_size, args.n_subquantizers, args.n_bits)
+    # index all passages
+    input_paths = glob.glob(args.passages_embeddings)
+    input_paths = sorted(input_paths)
+    embeddings_dir = os.path.dirname(input_paths[0])
+    index_path = os.path.join(embeddings_dir, "index.faiss")
+    if args.save_or_load_index and os.path.exists(index_path):
+        index.deserialize_from(embeddings_dir)
+    else:
+        print(f"Indexing passages from files {input_paths}")
+        start_time_indexing = time.time()
+        index_encoded_data(index, input_paths, args.indexing_batch_size)
+        print(f"Indexing time: {time.time()-start_time_indexing:.1f} s.")
+        if args.save_or_load_index:
+            index.serialize(embeddings_dir)
+    # load passages
+    passages = src.data.load_passages(args.passages)
+    passage_id_map = {x["id"]: x for x in passages}
+    data_paths = glob.glob(args.data)
+    alldata = []
+    for path in data_paths:
+        data = load_data(path)
+        output_path = os.path.join(args.output_dir, os.path.basename(path))
+        queries = [ex["question"] for ex in data]
+        questions_embedding = embed_queries(args, queries, model, tokenizer)
+        # get top k results
+        start_time_retrieval = time.time()
+        top_ids_and_scores = index.search_knn(questions_embedding, args.n_docs)
+        print(f"Search time: {time.time()-start_time_retrieval:.1f} s.")
+        add_passages(data, passage_id_map, top_ids_and_scores)
+        hasanswer = validate(data, args.validation_workers)
+        add_hasanswer(data, hasanswer)
+        os.makedirs(os.path.dirname(output_path), exist_ok=True)
+        with open(output_path, "w") as fout:
+            for ex in data:
+                json.dump(ex, fout, ensure_ascii=False)
+                fout.write("\n")
+        print(f"Saved results to {output_path}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--data",
+        required=True,
+        type=str,
+        default=None,
+        help=".json file containing question and answers, similar format to reader data",
+    )
+    parser.add_argument("--passages", type=str, default=None, help="Path to passages (.tsv file)")
+    parser.add_argument("--passages_embeddings", type=str, default=None, help="Glob path to encoded passages")
+    parser.add_argument(
+        "--output_dir", type=str, default=None, help="Results are written to outputdir with data suffix"
+    )
+    parser.add_argument("--n_docs", type=int, default=100, help="Number of documents to retrieve per questions")
+    parser.add_argument(
+        "--validation_workers", type=int, default=32, help="Number of parallel processes to validate results"
+    )
+    parser.add_argument("--per_gpu_batch_size", type=int, default=64, help="Batch size for question encoding")
+    parser.add_argument(
+        "--save_or_load_index", action="store_true", help="If enabled, save index and load index if it exists"
+    )
+    parser.add_argument(
+        "--model_name_or_path", type=str, help="path to directory containing model weights and config file"
+    )
+    parser.add_argument("--no_fp16", action="store_true", help="inference in fp32")
+    parser.add_argument("--question_maxlength", type=int, default=512, help="Maximum number of tokens in a question")
+    parser.add_argument(
+        "--indexing_batch_size", type=int, default=1000000, help="Batch size of the number of passages indexed"
+    )
+    parser.add_argument("--projection_size", type=int, default=768)
+    parser.add_argument(
+        "--n_subquantizers",
+        type=int,
+        default=0,
+        help="Number of subquantizer used for vector quantization, if 0 flat index is used",
+    )
+    parser.add_argument("--n_bits", type=int, default=8, help="Number of bits per subquantizer")
+    parser.add_argument("--lang", nargs="+")
+    parser.add_argument("--dataset", type=str, default="none")
+    parser.add_argument("--lowercase", action="store_true", help="lowercase text before encoding")
+    parser.add_argument("--normalize_text", action="store_true", help="normalize text")
+    args = parser.parse_args()
+    src.slurm.init_distributed_mode(args)
+    main(args)

sentence-transformers/preprocess.py ADDED Viewed

	@@ -0,0 +1,68 @@

+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+import os
+import argparse
+import torch
+import transformers
+from src.normalize_text import normalize
+def save(tensor, split_path):
+    if not os.path.exists(os.path.dirname(split_path)):
+        os.makedirs(os.path.dirname(split_path))
+    with open(split_path, 'wb') as fout:
+        torch.save(tensor, fout)
+def apply_tokenizer(path, tokenizer, normalize_text=False):
+    alltokens = []
+    lines = []
+    with open(path, "r", encoding="utf-8") as fin:
+        for k, line in enumerate(fin):
+            if normalize_text:
+                line = normalize(line)
+            lines.append(line)
+            if len(lines) > 1000000:
+                tokens = tokenizer.batch_encode_plus(lines, add_special_tokens=False)['input_ids']
+                tokens = [torch.tensor(x, dtype=torch.int) for x in tokens]
+                alltokens.extend(tokens)
+                lines = []
+    tokens = tokenizer.batch_encode_plus(lines, add_special_tokens=False)['input_ids']
+    tokens = [torch.tensor(x, dtype=torch.int) for x in tokens]
+    alltokens.extend(tokens)
+    alltokens = torch.cat(alltokens)
+    return alltokens
+def tokenize_file(args):
+    filename = os.path.basename(args.datapath)
+    savepath = os.path.join(args.outdir, f"{filename}.pkl")
+    if os.path.exists(savepath):
+        if args.overwrite:
+            print(f"File {savepath} already exists, overwriting")
+        else:
+            print(f"File {savepath} already exists, exiting")
+            return
+    try:
+        tokenizer = transformers.AutoTokenizer.from_pretrained(args.tokenizer, local_files_only=True)
+    except:
+        tokenizer = transformers.AutoTokenizer.from_pretrained(args.tokenizer, local_files_only=False)
+    print(f"Encoding {args.datapath}...")
+    tokens = apply_tokenizer(args.datapath, tokenizer, normalize_text=args.normalize_text)
+    print(f"Saving at {savepath}...")
+    save(tokens, savepath)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser.add_argument("--datapath", type=str)
+    parser.add_argument("--outdir", type=str)
+    parser.add_argument("--tokenizer", type=str)
+    parser.add_argument("--overwrite", action="store_true")
+    parser.add_argument("--normalize_text", action="store_true")
+    args, _ = parser.parse_known_args()
+    tokenize_file(args)

sentence-transformers/requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+transformers>=4.6.0,<5.0.0
+tokenizers>=0.10.3
+tqdm
+torch>=1.6.0
+torchvision
+numpy
+scikit-learn
+scipy
+nltk
+sentencepiece
+huggingface-hub

sentence-transformers/setup.cfg ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [metadata]
2	+ description-file = README.md

sentence-transformers/setup.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from setuptools import setup, find_packages
+with open("README.md", mode="r", encoding="utf-8") as readme_file:
+    readme = readme_file.read()
+setup(
+    name="sentence-transformers",
+    version="2.2.2",
+    author="Nils Reimers",
+    author_email="info@nils-reimers.de",
+    description="Multilingual text embeddings",
+    long_description=readme,
+    long_description_content_type="text/markdown",
+    license="Apache License 2.0",
+    url="https://www.SBERT.net",
+    download_url="https://github.com/UKPLab/sentence-transformers/",
+    packages=find_packages(),
+    python_requires=">=3.6.0",
+    install_requires=[
+        'transformers>=4.6.0,<5.0.0',
+        'tqdm',
+        'torch>=1.6.0',
+        'torchvision',
+        'numpy',
+        'scikit-learn',
+        'scipy',
+        'nltk',
+        'sentencepiece',
+        'huggingface-hub>=0.4.0'
+    ],
+    classifiers=[
+        "Development Status :: 5 - Production/Stable",
+        "Intended Audience :: Science/Research",
+        "License :: OSI Approved :: Apache Software License",
+        "Programming Language :: Python :: 3.6",
+        "Topic :: Scientific/Engineering :: Artificial Intelligence"
+    ],
+    keywords="Transformer Networks BERT XLNet sentence embedding PyTorch NLP deep learning"
+)

sentence-transformers/train.py ADDED Viewed

	@@ -0,0 +1,195 @@

+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+import os
+import time
+import sys
+import torch
+import logging
+import json
+import numpy as np
+import random
+import pickle
+import torch.distributed as dist
+from torch.utils.data import DataLoader, RandomSampler
+from src.options import Options
+from src import data, beir_utils, slurm, dist_utils, utils
+from src import moco, inbatch
+logger = logging.getLogger(__name__)
+def train(opt, model, optimizer, scheduler, step):
+    run_stats = utils.WeightedAvgStats()
+    tb_logger = utils.init_tb_logger(opt.output_dir)
+    logger.info("Data loading")
+    if isinstance(model, torch.nn.parallel.DistributedDataParallel):
+        tokenizer = model.module.tokenizer
+    else:
+        tokenizer = model.tokenizer
+    collator = data.Collator(opt=opt)
+    train_dataset = data.load_data(opt, tokenizer)
+    logger.warning(f"Data loading finished for rank {dist_utils.get_rank()}")
+    train_sampler = RandomSampler(train_dataset)
+    train_dataloader = DataLoader(
+        train_dataset,
+        sampler=train_sampler,
+        batch_size=opt.per_gpu_batch_size,
+        drop_last=True,
+        num_workers=opt.num_workers,
+        collate_fn=collator,
+    )
+    epoch = 1
+    model.train()
+    while step < opt.total_steps:
+        train_dataset.generate_offset()
+        logger.info(f"Start epoch {epoch}")
+        for i, batch in enumerate(train_dataloader):
+            step += 1
+            batch = {key: value.cuda() if isinstance(value, torch.Tensor) else value for key, value in batch.items()}
+            train_loss, iter_stats = model(**batch, stats_prefix="train")
+            train_loss.backward()
+            optimizer.step()
+            scheduler.step()
+            model.zero_grad()
+            run_stats.update(iter_stats)
+            if step % opt.log_freq == 0:
+                log = f"{step} / {opt.total_steps}"
+                for k, v in sorted(run_stats.average_stats.items()):
+                    log += f" | {k}: {v:.3f}"
+                    if tb_logger:
+                        tb_logger.add_scalar(k, v, step)
+                log += f" | lr: {scheduler.get_last_lr()[0]:0.3g}"
+                log += f" | Memory: {torch.cuda.max_memory_allocated()//1e9} GiB"
+                logger.info(log)
+                run_stats.reset()
+            if step % opt.eval_freq == 0:
+                if isinstance(model, torch.nn.parallel.DistributedDataParallel):
+                    encoder = model.module.get_encoder()
+                else:
+                    encoder = model.get_encoder()
+                eval_model(
+                    opt, query_encoder=encoder, doc_encoder=encoder, tokenizer=tokenizer, tb_logger=tb_logger, step=step
+                )
+                if dist_utils.is_main():
+                    utils.save(model, optimizer, scheduler, step, opt, opt.output_dir, f"lastlog")
+                model.train()
+            if dist_utils.is_main() and step % opt.save_freq == 0:
+                utils.save(model, optimizer, scheduler, step, opt, opt.output_dir, f"step-{step}")
+            if step > opt.total_steps:
+                break
+        epoch += 1
+def eval_model(opt, query_encoder, doc_encoder, tokenizer, tb_logger, step):
+    for datasetname in opt.eval_datasets:
+        metrics = beir_utils.evaluate_model(
+            query_encoder,
+            doc_encoder,
+            tokenizer,
+            dataset=datasetname,
+            batch_size=opt.per_gpu_eval_batch_size,
+            norm_doc=opt.norm_doc,
+            norm_query=opt.norm_query,
+            beir_dir=opt.eval_datasets_dir,
+            score_function=opt.score_function,
+            lower_case=opt.lower_case,
+            normalize_text=opt.eval_normalize_text,
+        )
+        message = []
+        if dist_utils.is_main():
+            for metric in ["NDCG@10", "Recall@10", "Recall@100"]:
+                message.append(f"{datasetname}/{metric}: {metrics[metric]:.2f}")
+                if tb_logger is not None:
+                    tb_logger.add_scalar(f"{datasetname}/{metric}", metrics[metric], step)
+            logger.info(" | ".join(message))
+if __name__ == "__main__":
+    logger.info("Start")
+    options = Options()
+    opt = options.parse()
+    torch.manual_seed(opt.seed)
+    slurm.init_distributed_mode(opt)
+    slurm.init_signal_handler()
+    directory_exists = os.path.isdir(opt.output_dir)
+    if dist.is_initialized():
+        dist.barrier()
+    os.makedirs(opt.output_dir, exist_ok=True)
+    if not directory_exists and dist_utils.is_main():
+        options.print_options(opt)
+    if dist.is_initialized():
+        dist.barrier()
+    utils.init_logger(opt)
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    if opt.contrastive_mode == "moco":
+        model_class = moco.MoCo
+    elif opt.contrastive_mode == "inbatch":
+        model_class = inbatch.InBatch
+    else:
+        raise ValueError(f"contrastive mode: {opt.contrastive_mode} not recognised")
+    if not directory_exists and opt.model_path == "none":
+        model = model_class(opt)
+        model = model.cuda()
+        optimizer, scheduler = utils.set_optim(opt, model)
+        step = 0
+    elif directory_exists:
+        model_path = os.path.join(opt.output_dir, "checkpoint", "latest")
+        model, optimizer, scheduler, opt_checkpoint, step = utils.load(
+            model_class,
+            model_path,
+            opt,
+            reset_params=False,
+        )
+        logger.info(f"Model loaded from {opt.output_dir}")
+    else:
+        model, optimizer, scheduler, opt_checkpoint, step = utils.load(
+            model_class,
+            opt.model_path,
+            opt,
+            reset_params=False if opt.continue_training else True,
+        )
+        if not opt.continue_training:
+            step = 0
+        logger.info(f"Model loaded from {opt.model_path}")
+    logger.info(utils.get_parameters(model))
+    if dist.is_initialized():
+        model = torch.nn.parallel.DistributedDataParallel(
+            model,
+            device_ids=[opt.local_rank],
+            output_device=opt.local_rank,
+            find_unused_parameters=False,
+        )
+        dist.barrier()
+    logger.info("Start training")
+    train(opt, model, optimizer, scheduler, step)