llmware
/

industry-bert-sec-v0.1

Feature Extraction

Transformers

PyTorch

bert

text-embeddings-inference

Model card Files Files and versions Community

doberst commited on Sep 30, 2023

Commit

e81f464

•

1 Parent(s): 98fbfb0

Upload README.md

Browse files

Files changed (1) hide show

README.md +14 -43

README.md CHANGED Viewed

@@ -13,66 +13,41 @@ industry-bert-sec-v0.1 is part of a series of industry-fine-tuned sentence_trans
 <!-- Provide a longer summary of what this model is. -->
-BERT-based 768-parameter drop-in substitute for non-industry-specific embeddings model.   This model was trained on a wide range of
-publicly available U.S. Securities and Exchange Commission (SEC) regulatory filings and related documents.
 - **Developed by:** llmware
-- **Shared by [optional]:** Darren Oberst
 - **Model type:** BERT-based Industry domain fine-tuned Sentence Transformer architecture
 - **Language(s) (NLP):** English
 - **License:** Apache 2.0
 - **Finetuned from model [optional]:** BERT-based model, fine-tuning methodology described below.
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-This model is intended to be used as a sentence embedding model, specifically for financial services and use cases involving regulatory and financial filing documents.
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-This model was fine-tuned using a custom self-supervised procedure that combined contrastive techniques with stochastic injections of
-distortions in the samples.  The methodology was derived, adapted and inspired primarily from three research papers cited below:
-TSDAE (Reimers), DeClutr (Giorgi), and Contrastive Tension (Carlsson).
 ## Citation [optional]
-Custom training protocol used to train the model, which was derived and inspired by the following papers:
 @article{wang-2021-TSDAE,
     title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
@@ -107,12 +82,8 @@ Custom training protocol used to train the model, which was derived and inspired
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]

 <!-- Provide a longer summary of what this model is. -->
+industry-bert-sec-v0.1 is a domain fine-tuned BERT-based 768-parameter Sentence Transformer model, intended to as a "drop-in"
+substitute for embeddings in financial and regulatory domains.   This model was trained on a wide range of publicly available U.S. Securities and Exchange Commission (SEC) regulatory filings and related documents.
 - **Developed by:** llmware
 - **Model type:** BERT-based Industry domain fine-tuned Sentence Transformer architecture
 - **Language(s) (NLP):** English
 - **License:** Apache 2.0
 - **Finetuned from model [optional]:** BERT-based model, fine-tuning methodology described below.
+## Model Use
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("llmware/industry-bert-sec-v0.1")
+model = AutoModel.from_pretrained("llmware/industry-bert-sec-v0.1")
 ## Bias, Risks, and Limitations
+This is a semantic embedding model, fine-tuned on public domain SEC filings and regulatory documents.   Results may vary if used outside of this
+domain, and like any embedding model, there is always the potential for anomalies in the vector embedding space.   No specific safeguards have
+put in place for safety or mitigate potential bias in the dataset.
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+This model was fine-tuned using a custom self-supervised procedure and custom dataset that combined contrastive techniques
+with stochastic injections of distortions in the samples.  The methodology was derived, adapted and inspired primarily from
+three research papers cited below:  TSDAE (Reimers), DeClutr (Giorgi), and Contrastive Tension (Carlsson).
 ## Citation [optional]
+Custom self-supervised training protocol used to train the model, which was derived and inspired by the following papers:
 @article{wang-2021-TSDAE,
     title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 ## Model Card Contact
+Darren Oberst @ llmware