full set E-I triplets

Browse files

Files changed (13) hide show

1_Pooling/config.json +10 -0
README.md +877 -0
config.json +58 -0
config_sentence_transformers.json +10 -0
configuration_hf_nomic_bert.py +56 -0
model.safetensors +3 -0
modeling_hf_nomic_bert.py +1234 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +62 -0
vocab.txt +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": false,
+  "pooling_mode_mean_tokens": true,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,877 @@

+---
+language: []
+library_name: sentence-transformers
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- dataset_size:100K<n<1M
+- loss:CachedMultipleNegativesRankingLoss
+base_model: nomic-ai/nomic-embed-text-v1.5
+metrics:
+- cosine_accuracy
+- dot_accuracy
+- manhattan_accuracy
+- euclidean_accuracy
+- max_accuracy
+widget:
+- source_sentence: 'search_query: adorime'
+  sentences:
+  - 'search_query: green air scents llc'
+  - 'search_query: dpms sbr accessories'
+  - 'search_query: sweaters cowl neck men'
+- source_sentence: 'search_query: serving'
+  sentences:
+  - 'search_query: ceramic cups without handles'
+  - 'search_query: 100 mm cigarette case'
+  - 'search_query: toddler girl leopard midi'
+- source_sentence: 'search_query: haierc'
+  sentences:
+  - 'search_query: homder'
+  - 'search_query: 3d milling metal cnc'
+  - 'search_query: sandals for women'
+- source_sentence: 'search_query: poppies'
+  sentences:
+  - 'search_query: fake plants without pot'
+  - 'search_query: tonsil stone remover'
+  - 'search_query: vestido corto sexy de mujer'
+- source_sentence: 'search_query: dab rig'
+  sentences:
+  - 'search_query: volcano weed vaporizer'
+  - 'search_query: 22 gold chain for men'
+  - 'search_query: apple watch screen protector'
+pipeline_tag: sentence-similarity
+model-index:
+- name: SentenceTransformer based on nomic-ai/nomic-embed-text-v1.5
+  results:
+  - task:
+      type: triplet
+      name: Triplet
+    dataset:
+      name: triplet esci
+      type: triplet-esci
+    metrics:
+    - type: cosine_accuracy
+      value: 0.7405
+      name: Cosine Accuracy
+    - type: dot_accuracy
+      value: 0.269
+      name: Dot Accuracy
+    - type: manhattan_accuracy
+      value: 0.7432
+      name: Manhattan Accuracy
+    - type: euclidean_accuracy
+      value: 0.7457
+      name: Euclidean Accuracy
+    - type: max_accuracy
+      value: 0.7457
+      name: Max Accuracy
+---
+# SentenceTransformer based on nomic-ai/nomic-embed-text-v1.5
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [nomic-ai/nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) <!-- at revision 91d2d6bfdddf0b0da840f901b533e99bae30d757 -->
+- **Maximum Sequence Length:** 8192 tokens
+- **Output Dimensionality:** 768 tokens
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("sentence_transformers_model_id")
+# Run inference
+sentences = [
+    'search_query: dab rig',
+    'search_query: volcano weed vaporizer',
+    'search_query: 22 gold chain for men',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 768]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+## Evaluation
+### Metrics
+#### Triplet
+* Dataset: `triplet-esci`
+* Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
+| Metric              | Value      |
+|:--------------------|:-----------|
+| **cosine_accuracy** | **0.7405** |
+| dot_accuracy        | 0.269      |
+| manhattan_accuracy  | 0.7432     |
+| euclidean_accuracy  | 0.7457     |
+| max_accuracy        | 0.7457     |
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### Unnamed Dataset
+* Size: 167,039 training samples
+* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | anchor                                                                           | positive                                                                            | negative                                                                           |
+  |:--------|:---------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|
+  | type    | string                                                                           | string                                                                              | string                                                                             |
+  | details | <ul><li>min: 7 tokens</li><li>mean: 11.1 tokens</li><li>max: 38 tokens</li></ul> | <ul><li>min: 14 tokens</li><li>mean: 43.23 tokens</li><li>max: 124 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 43.16 tokens</li><li>max: 97 tokens</li></ul> |
+* Samples:
+  | anchor                                                  | positive                                                                                                                                                                                             | negative                                                                                                                                                                                                                                  |
+  |:--------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>search_query: foos ball coffee table</code>       | <code>search_document: KICK Vanquish 55" in Foosball Table, KICK, Blue/Gray</code>                                                                                                                   | <code>search_document: KICK Legend 55" Foosball Table (Black), KICK, Black</code>                                                                                                                                                         |
+  | <code>search_query: bathroom rugs white washable</code> | <code>search_document: Luxury Bath Mat Floor Towel Set - Absorbent Cotton Hotel Spa Shower/Bathtub Mats [Not a Bathroom Rug] 22"x34" | White | 2 Pack, White Classic, White</code>                   | <code>search_document: Utopia Towels Cotton Banded Bath Mats, White [Not a Bathroom Rug] 21 x 34 Inches, 100% Ring Spun Cotton - Highly Absorbent and Machine Washable Shower Bathroom Floor Mat (Pack of 2), Utopia Towels, White</code> |
+  | <code>search_query: kids gloves</code>                  | <code>search_document: EvridWear Boys Girls Magic Stretch Gripper Gloves 3 Pair Pack Assortment, Kids One Size Winter Warm Gloves Children (8-14Years, 3 Pairs Camo), Evridwear, 3 Pairs Camo</code> | <code>search_document: Body Glove Little Boys 2-Piece UPF 50+ Rash Guard Swimsuit Set (2 Piece), All Black, Size 5, Body Glove, All Black</code>                                                                                          |
+* Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim"
+  }
+  ```
+### Evaluation Dataset
+#### Unnamed Dataset
+* Size: 10,000 evaluation samples
+* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | anchor                                                                            | positive                                                                           | negative                                                                            |
+  |:--------|:----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
+  | type    | string                                                                            | string                                                                             | string                                                                              |
+  | details | <ul><li>min: 7 tokens</li><li>mean: 11.44 tokens</li><li>max: 31 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 42.26 tokens</li><li>max: 92 tokens</li></ul> | <ul><li>min: 16 tokens</li><li>mean: 42.28 tokens</li><li>max: 105 tokens</li></ul> |
+* Samples:
+  | anchor                                                              | positive                                                                                                                                                                                                                                                          | negative                                                                                                                                                                                          |
+  |:--------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>search_query: defender series iphone 8</code>                 | <code>search_document: Hand-e Muscle Series Belt Clip Case for Apple iPhone 7 / iPhone 8 / iPhone SE “2020” (4.7”) 2-in-1 Protective Defender w Screen Protector & Holster & Kickstand/Shock & Drop Proof – Camouflage/Orange, Hand-e, Camouflage / Orange</code> | <code>search_document: OtterBox Defender Series Rugged Case for iPhone 8 PLUS & iPhone 7 PLUS - Case Only - Non-Retail Packaging - Dark Lake - With Microbial Defense, OtterBox, Dark Lake</code> |
+  | <code>search_query: joy mangano</code>                              | <code>search_document: Joy by Joy Mangano 11-Piece Complete Luxury Towel Set, Ivory, Joy Mangano, Ivory</code>                                                                                                                                                    | <code>search_document: BAGSMART Jewelry Organizer Case Travel Jewelry Storage Bag for Necklace, Earrings, Rings, Bracelet, Soft Pink, BAGSMART, Soft Pink</code>                                  |
+  | <code>search_query: cashel fly masks for horses without ears</code> | <code>search_document: Cashel Crusader Designer Horse Fly Mask, Leopard, Weanling, Cashel, Leopard</code>                                                                                                                                                         | <code>search_document: Cashel Crusader Designer Horse Fly Mask with Ears, Teal Tribal, Weanling, Cashel, Teal Tribal</code>                                                                       |
+* Loss: [<code>CachedMultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim"
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `per_device_train_batch_size`: 4
+- `per_device_eval_batch_size`: 4
+- `gradient_accumulation_steps`: 4
+- `learning_rate`: 1e-06
+- `num_train_epochs`: 5
+- `lr_scheduler_type`: cosine_with_restarts
+- `warmup_ratio`: 0.1
+- `dataloader_drop_last`: True
+- `dataloader_num_workers`: 4
+- `dataloader_prefetch_factor`: 2
+- `load_best_model_at_end`: True
+- `batch_sampler`: no_duplicates
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 4
+- `per_device_eval_batch_size`: 4
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 4
+- `eval_accumulation_steps`: None
+- `learning_rate`: 1e-06
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1.0
+- `num_train_epochs`: 5
+- `max_steps`: -1
+- `lr_scheduler_type`: cosine_with_restarts
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.1
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: False
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: True
+- `dataloader_num_workers`: 4
+- `dataloader_prefetch_factor`: 2
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: True
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: False
+- `hub_always_push`: False
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `dispatch_batches`: None
+- `split_batches`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `batch_sampler`: no_duplicates
+- `multi_dataset_batch_sampler`: proportional
+</details>
+### Training Logs
+<details><summary>Click to expand</summary>
+| Epoch  | Step  | Training Loss | loss   | triplet-esci_cosine_accuracy |
+|:------:|:-----:|:-------------:|:------:|:----------------------------:|
+| 0.0096 | 100   | 0.6669        | -      | -                            |
+| 0.0192 | 200   | 0.6633        | -      | -                            |
+| 0.0287 | 300   | 0.6575        | -      | -                            |
+| 0.0383 | 400   | 0.6638        | -      | -                            |
+| 0.0479 | 500   | 0.6191        | -      | -                            |
+| 0.0575 | 600   | 0.6464        | -      | -                            |
+| 0.0671 | 700   | 0.6291        | -      | -                            |
+| 0.0766 | 800   | 0.5973        | -      | -                            |
+| 0.0862 | 900   | 0.605         | -      | -                            |
+| 0.0958 | 1000  | 0.6278        | 0.6525 | 0.7269                       |
+| 0.1054 | 1100  | 0.6041        | -      | -                            |
+| 0.1149 | 1200  | 0.6077        | -      | -                            |
+| 0.1245 | 1300  | 0.589         | -      | -                            |
+| 0.1341 | 1400  | 0.5811        | -      | -                            |
+| 0.1437 | 1500  | 0.5512        | -      | -                            |
+| 0.1533 | 1600  | 0.5907        | -      | -                            |
+| 0.1628 | 1700  | 0.5718        | -      | -                            |
+| 0.1724 | 1800  | 0.5446        | -      | -                            |
+| 0.1820 | 1900  | 0.546         | -      | -                            |
+| 0.1916 | 2000  | 0.5141        | 0.6105 | 0.7386                       |
+| 0.2012 | 2100  | 0.5359        | -      | -                            |
+| 0.2107 | 2200  | 0.5093        | -      | -                            |
+| 0.2203 | 2300  | 0.5384        | -      | -                            |
+| 0.2299 | 2400  | 0.5582        | -      | -                            |
+| 0.2395 | 2500  | 0.5038        | -      | -                            |
+| 0.2490 | 2600  | 0.5031        | -      | -                            |
+| 0.2586 | 2700  | 0.5393        | -      | -                            |
+| 0.2682 | 2800  | 0.4979        | -      | -                            |
+| 0.2778 | 2900  | 0.5221        | -      | -                            |
+| 0.2874 | 3000  | 0.4956        | 0.5852 | 0.7495                       |
+| 0.2969 | 3100  | 0.506         | -      | -                            |
+| 0.3065 | 3200  | 0.4962        | -      | -                            |
+| 0.3161 | 3300  | 0.4713        | -      | -                            |
+| 0.3257 | 3400  | 0.5016        | -      | -                            |
+| 0.3353 | 3500  | 0.4749        | -      | -                            |
+| 0.3448 | 3600  | 0.4732        | -      | -                            |
+| 0.3544 | 3700  | 0.4789        | -      | -                            |
+| 0.3640 | 3800  | 0.4825        | -      | -                            |
+| 0.3736 | 3900  | 0.4803        | -      | -                            |
+| 0.3832 | 4000  | 0.4471        | 0.5743 | 0.7546                       |
+| 0.3927 | 4100  | 0.4593        | -      | -                            |
+| 0.4023 | 4200  | 0.4481        | -      | -                            |
+| 0.4119 | 4300  | 0.4603        | -      | -                            |
+| 0.4215 | 4400  | 0.4569        | -      | -                            |
+| 0.4310 | 4500  | 0.4807        | -      | -                            |
+| 0.4406 | 4600  | 0.4368        | -      | -                            |
+| 0.4502 | 4700  | 0.4532        | -      | -                            |
+| 0.4598 | 4800  | 0.4432        | -      | -                            |
+| 0.4694 | 4900  | 0.4802        | -      | -                            |
+| 0.4789 | 5000  | 0.4643        | 0.5663 | 0.7593                       |
+| 0.4885 | 5100  | 0.4154        | -      | -                            |
+| 0.4981 | 5200  | 0.4441        | -      | -                            |
+| 0.5077 | 5300  | 0.4156        | -      | -                            |
+| 0.5173 | 5400  | 0.4273        | -      | -                            |
+| 0.5268 | 5500  | 0.3988        | -      | -                            |
+| 0.5364 | 5600  | 0.3942        | -      | -                            |
+| 0.5460 | 5700  | 0.4186        | -      | -                            |
+| 0.5556 | 5800  | 0.423         | -      | -                            |
+| 0.5651 | 5900  | 0.434         | -      | -                            |
+| 0.5747 | 6000  | 0.4136        | 0.5704 | 0.7616                       |
+| 0.5843 | 6100  | 0.3968        | -      | -                            |
+| 0.5939 | 6200  | 0.4045        | -      | -                            |
+| 0.6035 | 6300  | 0.4122        | -      | -                            |
+| 0.6130 | 6400  | 0.3618        | -      | -                            |
+| 0.6226 | 6500  | 0.341         | -      | -                            |
+| 0.6322 | 6600  | 0.3689        | -      | -                            |
+| 0.6418 | 6700  | 0.3621        | -      | -                            |
+| 0.6514 | 6800  | 0.3774        | -      | -                            |
+| 0.6609 | 6900  | 0.3519        | -      | -                            |
+| 0.6705 | 7000  | 0.3974        | 0.5729 | 0.7644                       |
+| 0.6801 | 7100  | 0.3443        | -      | -                            |
+| 0.6897 | 7200  | 0.3665        | -      | -                            |
+| 0.6993 | 7300  | 0.3683        | -      | -                            |
+| 0.7088 | 7400  | 0.3593        | -      | -                            |
+| 0.7184 | 7500  | 0.3419        | -      | -                            |
+| 0.7280 | 7600  | 0.3587        | -      | -                            |
+| 0.7376 | 7700  | 0.3463        | -      | -                            |
+| 0.7471 | 7800  | 0.3417        | -      | -                            |
+| 0.7567 | 7900  | 0.32          | -      | -                            |
+| 0.7663 | 8000  | 0.32          | 0.5735 | 0.7677                       |
+| 0.7759 | 8100  | 0.3296        | -      | -                            |
+| 0.7855 | 8200  | 0.3492        | -      | -                            |
+| 0.7950 | 8300  | 0.3022        | -      | -                            |
+| 0.8046 | 8400  | 0.3159        | -      | -                            |
+| 0.8142 | 8500  | 0.3172        | -      | -                            |
+| 0.8238 | 8600  | 0.3157        | -      | -                            |
+| 0.8334 | 8700  | 0.3271        | -      | -                            |
+| 0.8429 | 8800  | 0.337         | -      | -                            |
+| 0.8525 | 8900  | 0.322         | -      | -                            |
+| 0.8621 | 9000  | 0.3187        | 0.5803 | 0.7652                       |
+| 0.8717 | 9100  | 0.307         | -      | -                            |
+| 0.8812 | 9200  | 0.2984        | -      | -                            |
+| 0.8908 | 9300  | 0.2727        | -      | -                            |
+| 0.9004 | 9400  | 0.304         | -      | -                            |
+| 0.9100 | 9500  | 0.321         | -      | -                            |
+| 0.9196 | 9600  | 0.304         | -      | -                            |
+| 0.9291 | 9700  | 0.3302        | -      | -                            |
+| 0.9387 | 9800  | 0.3302        | -      | -                            |
+| 0.9483 | 9900  | 0.3134        | -      | -                            |
+| 0.9579 | 10000 | 0.2936        | 0.5858 | 0.7671                       |
+| 0.9675 | 10100 | 0.2953        | -      | -                            |
+| 0.9770 | 10200 | 0.3035        | -      | -                            |
+| 0.9866 | 10300 | 0.303         | -      | -                            |
+| 0.9962 | 10400 | 0.2606        | -      | -                            |
+| 1.0058 | 10500 | 0.2615        | -      | -                            |
+| 1.0153 | 10600 | 0.2703        | -      | -                            |
+| 1.0249 | 10700 | 0.2761        | -      | -                            |
+| 1.0345 | 10800 | 0.2559        | -      | -                            |
+| 1.0441 | 10900 | 0.2672        | -      | -                            |
+| 1.0537 | 11000 | 0.2656        | 0.5933 | 0.7676                       |
+| 1.0632 | 11100 | 0.2825        | -      | -                            |
+| 1.0728 | 11200 | 0.2484        | -      | -                            |
+| 1.0824 | 11300 | 0.2472        | -      | -                            |
+| 1.0920 | 11400 | 0.2678        | -      | -                            |
+| 1.1016 | 11500 | 0.2443        | -      | -                            |
+| 1.1111 | 11600 | 0.2685        | -      | -                            |
+| 1.1207 | 11700 | 0.2504        | -      | -                            |
+| 1.1303 | 11800 | 0.2431        | -      | -                            |
+| 1.1399 | 11900 | 0.2248        | -      | -                            |
+| 1.1495 | 12000 | 0.2229        | 0.5958 | 0.7688                       |
+| 1.1590 | 12100 | 0.228         | -      | -                            |
+| 1.1686 | 12200 | 0.2304        | -      | -                            |
+| 1.1782 | 12300 | 0.2193        | -      | -                            |
+| 1.1878 | 12400 | 0.2238        | -      | -                            |
+| 1.1973 | 12500 | 0.1957        | -      | -                            |
+| 1.2069 | 12600 | 0.2075        | -      | -                            |
+| 1.2165 | 12700 | 0.2014        | -      | -                            |
+| 1.2261 | 12800 | 0.2222        | -      | -                            |
+| 1.2357 | 12900 | 0.2059        | -      | -                            |
+| 1.2452 | 13000 | 0.2051        | 0.6077 | 0.7651                       |
+| 1.2548 | 13100 | 0.2076        | -      | -                            |
+| 1.2644 | 13200 | 0.226         | -      | -                            |
+| 1.2740 | 13300 | 0.1941        | -      | -                            |
+| 1.2836 | 13400 | 0.2053        | -      | -                            |
+| 1.2931 | 13500 | 0.2003        | -      | -                            |
+| 1.3027 | 13600 | 0.1947        | -      | -                            |
+| 1.3123 | 13700 | 0.1914        | -      | -                            |
+| 1.3219 | 13800 | 0.1956        | -      | -                            |
+| 1.3314 | 13900 | 0.1862        | -      | -                            |
+| 1.3410 | 14000 | 0.1873        | 0.6110 | 0.7646                       |
+| 1.3506 | 14100 | 0.1812        | -      | -                            |
+| 1.3602 | 14200 | 0.1828        | -      | -                            |
+| 1.3698 | 14300 | 0.1696        | -      | -                            |
+| 1.3793 | 14400 | 0.1705        | -      | -                            |
+| 1.3889 | 14500 | 0.1746        | -      | -                            |
+| 1.3985 | 14600 | 0.1756        | -      | -                            |
+| 1.4081 | 14700 | 0.1682        | -      | -                            |
+| 1.4177 | 14800 | 0.1769        | -      | -                            |
+| 1.4272 | 14900 | 0.1795        | -      | -                            |
+| 1.4368 | 15000 | 0.1736        | 0.6278 | 0.7616                       |
+| 1.4464 | 15100 | 0.1546        | -      | -                            |
+| 1.4560 | 15200 | 0.1643        | -      | -                            |
+| 1.4656 | 15300 | 0.1903        | -      | -                            |
+| 1.4751 | 15400 | 0.1902        | -      | -                            |
+| 1.4847 | 15500 | 0.1531        | -      | -                            |
+| 1.4943 | 15600 | 0.1711        | -      | -                            |
+| 1.5039 | 15700 | 0.1546        | -      | -                            |
+| 1.5134 | 15800 | 0.1503        | -      | -                            |
+| 1.5230 | 15900 | 0.1429        | -      | -                            |
+| 1.5326 | 16000 | 0.147         | 0.6306 | 0.7623                       |
+| 1.5422 | 16100 | 0.1507        | -      | -                            |
+| 1.5518 | 16200 | 0.152         | -      | -                            |
+| 1.5613 | 16300 | 0.1602        | -      | -                            |
+| 1.5709 | 16400 | 0.1541        | -      | -                            |
+| 1.5805 | 16500 | 0.1491        | -      | -                            |
+| 1.5901 | 16600 | 0.1378        | -      | -                            |
+| 1.5997 | 16700 | 0.1505        | -      | -                            |
+| 1.6092 | 16800 | 0.1334        | -      | -                            |
+| 1.6188 | 16900 | 0.1288        | -      | -                            |
+| 1.6284 | 17000 | 0.1168        | 0.6372 | 0.7629                       |
+| 1.6380 | 17100 | 0.135         | -      | -                            |
+| 1.6475 | 17200 | 0.1239        | -      | -                            |
+| 1.6571 | 17300 | 0.1398        | -      | -                            |
+| 1.6667 | 17400 | 0.1292        | -      | -                            |
+| 1.6763 | 17500 | 0.1414        | -      | -                            |
+| 1.6859 | 17600 | 0.116         | -      | -                            |
+| 1.6954 | 17700 | 0.1302        | -      | -                            |
+| 1.7050 | 17800 | 0.1194        | -      | -                            |
+| 1.7146 | 17900 | 0.1394        | -      | -                            |
+| 1.7242 | 18000 | 0.1316        | 0.6561 | 0.7592                       |
+| 1.7338 | 18100 | 0.1246        | -      | -                            |
+| 1.7433 | 18200 | 0.1277        | -      | -                            |
+| 1.7529 | 18300 | 0.1055        | -      | -                            |
+| 1.7625 | 18400 | 0.1211        | -      | -                            |
+| 1.7721 | 18500 | 0.1107        | -      | -                            |
+| 1.7817 | 18600 | 0.1145        | -      | -                            |
+| 1.7912 | 18700 | 0.1162        | -      | -                            |
+| 1.8008 | 18800 | 0.1114        | -      | -                            |
+| 1.8104 | 18900 | 0.1182        | -      | -                            |
+| 1.8200 | 19000 | 0.1152        | 0.6567 | 0.7591                       |
+| 1.8295 | 19100 | 0.1212        | -      | -                            |
+| 1.8391 | 19200 | 0.1253        | -      | -                            |
+| 1.8487 | 19300 | 0.115         | -      | -                            |
+| 1.8583 | 19400 | 0.1292        | -      | -                            |
+| 1.8679 | 19500 | 0.1151        | -      | -                            |
+| 1.8774 | 19600 | 0.1005        | -      | -                            |
+| 1.8870 | 19700 | 0.1079        | -      | -                            |
+| 1.8966 | 19800 | 0.0954        | -      | -                            |
+| 1.9062 | 19900 | 0.1045        | -      | -                            |
+| 1.9158 | 20000 | 0.1086        | 0.6727 | 0.7554                       |
+| 1.9253 | 20100 | 0.1174        | -      | -                            |
+| 1.9349 | 20200 | 0.1108        | -      | -                            |
+| 1.9445 | 20300 | 0.0992        | -      | -                            |
+| 1.9541 | 20400 | 0.1168        | -      | -                            |
+| 1.9636 | 20500 | 0.1028        | -      | -                            |
+| 1.9732 | 20600 | 0.1126        | -      | -                            |
+| 1.9828 | 20700 | 0.1113        | -      | -                            |
+| 1.9924 | 20800 | 0.1065        | -      | -                            |
+| 2.0020 | 20900 | 0.078         | -      | -                            |
+| 2.0115 | 21000 | 0.0921        | 0.6727 | 0.7568                       |
+| 2.0211 | 21100 | 0.0866        | -      | -                            |
+| 2.0307 | 21200 | 0.0918        | -      | -                            |
+| 2.0403 | 21300 | 0.0893        | -      | -                            |
+| 2.0499 | 21400 | 0.0882        | -      | -                            |
+| 2.0594 | 21500 | 0.0986        | -      | -                            |
+| 2.0690 | 21600 | 0.0923        | -      | -                            |
+| 2.0786 | 21700 | 0.0805        | -      | -                            |
+| 2.0882 | 21800 | 0.0887        | -      | -                            |
+| 2.0978 | 21900 | 0.1           | -      | -                            |
+| 2.1073 | 22000 | 0.0957        | 0.6854 | 0.7539                       |
+| 2.1169 | 22100 | 0.0921        | -      | -                            |
+| 2.1265 | 22200 | 0.0892        | -      | -                            |
+| 2.1361 | 22300 | 0.0805        | -      | -                            |
+| 2.1456 | 22400 | 0.0767        | -      | -                            |
+| 2.1552 | 22500 | 0.0715        | -      | -                            |
+| 2.1648 | 22600 | 0.083         | -      | -                            |
+| 2.1744 | 22700 | 0.0755        | -      | -                            |
+| 2.1840 | 22800 | 0.075         | -      | -                            |
+| 2.1935 | 22900 | 0.0724        | -      | -                            |
+| 2.2031 | 23000 | 0.0822        | 0.6913 | 0.7534                       |
+| 2.2127 | 23100 | 0.0623        | -      | -                            |
+| 2.2223 | 23200 | 0.0765        | -      | -                            |
+| 2.2319 | 23300 | 0.0755        | -      | -                            |
+| 2.2414 | 23400 | 0.0786        | -      | -                            |
+| 2.2510 | 23500 | 0.0651        | -      | -                            |
+| 2.2606 | 23600 | 0.081         | -      | -                            |
+| 2.2702 | 23700 | 0.0664        | -      | -                            |
+| 2.2797 | 23800 | 0.0906        | -      | -                            |
+| 2.2893 | 23900 | 0.0714        | -      | -                            |
+| 2.2989 | 24000 | 0.0703        | 0.6971 | 0.7536                       |
+| 2.3085 | 24100 | 0.0672        | -      | -                            |
+| 2.3181 | 24200 | 0.0754        | -      | -                            |
+| 2.3276 | 24300 | 0.0687        | -      | -                            |
+| 2.3372 | 24400 | 0.0668        | -      | -                            |
+| 2.3468 | 24500 | 0.0616        | -      | -                            |
+| 2.3564 | 24600 | 0.0693        | -      | -                            |
+| 2.3660 | 24700 | 0.0587        | -      | -                            |
+| 2.3755 | 24800 | 0.0612        | -      | -                            |
+| 2.3851 | 24900 | 0.0559        | -      | -                            |
+| 2.3947 | 25000 | 0.0676        | 0.7128 | 0.7497                       |
+| 2.4043 | 25100 | 0.0607        | -      | -                            |
+| 2.4139 | 25200 | 0.0727        | -      | -                            |
+| 2.4234 | 25300 | 0.0573        | -      | -                            |
+| 2.4330 | 25400 | 0.0717        | -      | -                            |
+| 2.4426 | 25500 | 0.0493        | -      | -                            |
+| 2.4522 | 25600 | 0.0558        | -      | -                            |
+| 2.4617 | 25700 | 0.0676        | -      | -                            |
+| 2.4713 | 25800 | 0.0757        | -      | -                            |
+| 2.4809 | 25900 | 0.0735        | -      | -                            |
+| 2.4905 | 26000 | 0.056         | 0.7044 | 0.7513                       |
+| 2.5001 | 26100 | 0.0687        | -      | -                            |
+| 2.5096 | 26200 | 0.0592        | -      | -                            |
+| 2.5192 | 26300 | 0.057         | -      | -                            |
+| 2.5288 | 26400 | 0.0444        | -      | -                            |
+| 2.5384 | 26500 | 0.0547        | -      | -                            |
+| 2.5480 | 26600 | 0.0605        | -      | -                            |
+| 2.5575 | 26700 | 0.066         | -      | -                            |
+| 2.5671 | 26800 | 0.0631        | -      | -                            |
+| 2.5767 | 26900 | 0.0634        | -      | -                            |
+| 2.5863 | 27000 | 0.0537        | 0.7127 | 0.7512                       |
+| 2.5958 | 27100 | 0.0535        | -      | -                            |
+| 2.6054 | 27200 | 0.0572        | -      | -                            |
+| 2.6150 | 27300 | 0.0473        | -      | -                            |
+| 2.6246 | 27400 | 0.0418        | -      | -                            |
+| 2.6342 | 27500 | 0.0585        | -      | -                            |
+| 2.6437 | 27600 | 0.0475        | -      | -                            |
+| 2.6533 | 27700 | 0.0549        | -      | -                            |
+| 2.6629 | 27800 | 0.0452        | -      | -                            |
+| 2.6725 | 27900 | 0.0514        | -      | -                            |
+| 2.6821 | 28000 | 0.0449        | 0.7337 | 0.7482                       |
+| 2.6916 | 28100 | 0.0544        | -      | -                            |
+| 2.7012 | 28200 | 0.041         | -      | -                            |
+| 2.7108 | 28300 | 0.0599        | -      | -                            |
+| 2.7204 | 28400 | 0.057         | -      | -                            |
+| 2.7300 | 28500 | 0.0503        | -      | -                            |
+| 2.7395 | 28600 | 0.0487        | -      | -                            |
+| 2.7491 | 28700 | 0.0503        | -      | -                            |
+| 2.7587 | 28800 | 0.0446        | -      | -                            |
+| 2.7683 | 28900 | 0.042         | -      | -                            |
+| 2.7778 | 29000 | 0.0501        | 0.7422 | 0.7469                       |
+| 2.7874 | 29100 | 0.0494        | -      | -                            |
+| 2.7970 | 29200 | 0.0423        | -      | -                            |
+| 2.8066 | 29300 | 0.0508        | -      | -                            |
+| 2.8162 | 29400 | 0.0459        | -      | -                            |
+| 2.8257 | 29500 | 0.0514        | -      | -                            |
+| 2.8353 | 29600 | 0.0484        | -      | -                            |
+| 2.8449 | 29700 | 0.0571        | -      | -                            |
+| 2.8545 | 29800 | 0.0558        | -      | -                            |
+| 2.8641 | 29900 | 0.0466        | -      | -                            |
+| 2.8736 | 30000 | 0.0465        | 0.7478 | 0.7447                       |
+| 2.8832 | 30100 | 0.0463        | -      | -                            |
+| 2.8928 | 30200 | 0.0362        | -      | -                            |
+| 2.9024 | 30300 | 0.0435        | -      | -                            |
+| 2.9119 | 30400 | 0.0419        | -      | -                            |
+| 2.9215 | 30500 | 0.046         | -      | -                            |
+| 2.9311 | 30600 | 0.0451        | -      | -                            |
+| 2.9407 | 30700 | 0.0458        | -      | -                            |
+| 2.9503 | 30800 | 0.052         | -      | -                            |
+| 2.9598 | 30900 | 0.0454        | -      | -                            |
+| 2.9694 | 31000 | 0.0433        | 0.7580 | 0.745                        |
+| 2.9790 | 31100 | 0.0438        | -      | -                            |
+| 2.9886 | 31200 | 0.0537        | -      | -                            |
+| 2.9982 | 31300 | 0.033         | -      | -                            |
+| 3.0077 | 31400 | 0.0384        | -      | -                            |
+| 3.0173 | 31500 | 0.0349        | -      | -                            |
+| 3.0269 | 31600 | 0.0365        | -      | -                            |
+| 3.0365 | 31700 | 0.0397        | -      | -                            |
+| 3.0460 | 31800 | 0.0396        | -      | -                            |
+| 3.0556 | 31900 | 0.0358        | -      | -                            |
+| 3.0652 | 32000 | 0.0443        | 0.7592 | 0.7454                       |
+| 3.0748 | 32100 | 0.0323        | -      | -                            |
+| 3.0844 | 32200 | 0.0418        | -      | -                            |
+| 3.0939 | 32300 | 0.0463        | -      | -                            |
+| 3.1035 | 32400 | 0.0397        | -      | -                            |
+| 3.1131 | 32500 | 0.0425        | -      | -                            |
+| 3.1227 | 32600 | 0.0406        | -      | -                            |
+| 3.1323 | 32700 | 0.0454        | -      | -                            |
+| 3.1418 | 32800 | 0.0287        | -      | -                            |
+| 3.1514 | 32900 | 0.0267        | -      | -                            |
+| 3.1610 | 33000 | 0.0341        | 0.7672 | 0.7431                       |
+| 3.1706 | 33100 | 0.0357        | -      | -                            |
+| 3.1802 | 33200 | 0.0322        | -      | -                            |
+| 3.1897 | 33300 | 0.0367        | -      | -                            |
+| 3.1993 | 33400 | 0.0419        | -      | -                            |
+| 3.2089 | 33500 | 0.0349        | -      | -                            |
+| 3.2185 | 33600 | 0.0327        | -      | -                            |
+| 3.2280 | 33700 | 0.0377        | -      | -                            |
+| 3.2376 | 33800 | 0.0353        | -      | -                            |
+| 3.2472 | 33900 | 0.0305        | -      | -                            |
+| 3.2568 | 34000 | 0.0362        | 0.7668 | 0.7463                       |
+| 3.2664 | 34100 | 0.0311        | -      | -                            |
+| 3.2759 | 34200 | 0.0405        | -      | -                            |
+| 3.2855 | 34300 | 0.0401        | -      | -                            |
+| 3.2951 | 34400 | 0.0361        | -      | -                            |
+| 3.3047 | 34500 | 0.0302        | -      | -                            |
+| 3.3143 | 34600 | 0.0379        | -      | -                            |
+| 3.3238 | 34700 | 0.03          | -      | -                            |
+| 3.3334 | 34800 | 0.039         | -      | -                            |
+| 3.3430 | 34900 | 0.0288        | -      | -                            |
+| 3.3526 | 35000 | 0.0318        | 0.7782 | 0.7436                       |
+| 3.3621 | 35100 | 0.0283        | -      | -                            |
+| 3.3717 | 35200 | 0.029         | -      | -                            |
+| 3.3813 | 35300 | 0.0287        | -      | -                            |
+| 3.3909 | 35400 | 0.0343        | -      | -                            |
+| 3.4005 | 35500 | 0.0326        | -      | -                            |
+| 3.4100 | 35600 | 0.031         | -      | -                            |
+| 3.4196 | 35700 | 0.0304        | -      | -                            |
+| 3.4292 | 35800 | 0.0314        | -      | -                            |
+| 3.4388 | 35900 | 0.0286        | -      | -                            |
+| 3.4484 | 36000 | 0.0229        | 0.7978 | 0.7428                       |
+| 3.4579 | 36100 | 0.0258        | -      | -                            |
+| 3.4675 | 36200 | 0.043         | -      | -                            |
+| 3.4771 | 36300 | 0.042         | -      | -                            |
+| 3.4867 | 36400 | 0.029         | -      | -                            |
+| 3.4963 | 36500 | 0.0343        | -      | -                            |
+| 3.5058 | 36600 | 0.0317        | -      | -                            |
+| 3.5154 | 36700 | 0.0307        | -      | -                            |
+| 3.5250 | 36800 | 0.0251        | -      | -                            |
+| 3.5346 | 36900 | 0.025         | -      | -                            |
+| 3.5441 | 37000 | 0.0309        | 0.8002 | 0.7446                       |
+| 3.5537 | 37100 | 0.031         | -      | -                            |
+| 3.5633 | 37200 | 0.0345        | -      | -                            |
+| 3.5729 | 37300 | 0.0332        | -      | -                            |
+| 3.5825 | 37400 | 0.0346        | -      | -                            |
+| 3.5920 | 37500 | 0.026         | -      | -                            |
+| 3.6016 | 37600 | 0.0293        | -      | -                            |
+| 3.6112 | 37700 | 0.0268        | -      | -                            |
+| 3.6208 | 37800 | 0.0264        | -      | -                            |
+| 3.6304 | 37900 | 0.0259        | -      | -                            |
+| 3.6399 | 38000 | 0.032         | 0.7896 | 0.7438                       |
+| 3.6495 | 38100 | 0.0246        | -      | -                            |
+| 3.6591 | 38200 | 0.0279        | -      | -                            |
+| 3.6687 | 38300 | 0.0274        | -      | -                            |
+| 3.6782 | 38400 | 0.0241        | -      | -                            |
+| 3.6878 | 38500 | 0.027         | -      | -                            |
+| 3.6974 | 38600 | 0.022         | -      | -                            |
+| 3.7070 | 38700 | 0.0305        | -      | -                            |
+| 3.7166 | 38800 | 0.0368        | -      | -                            |
+| 3.7261 | 38900 | 0.0304        | -      | -                            |
+| 3.7357 | 39000 | 0.0249        | 0.7978 | 0.7437                       |
+| 3.7453 | 39100 | 0.0312        | -      | -                            |
+| 3.7549 | 39200 | 0.0257        | -      | -                            |
+| 3.7645 | 39300 | 0.0273        | -      | -                            |
+| 3.7740 | 39400 | 0.0209        | -      | -                            |
+| 3.7836 | 39500 | 0.0298        | -      | -                            |
+| 3.7932 | 39600 | 0.0282        | -      | -                            |
+| 3.8028 | 39700 | 0.028         | -      | -                            |
+| 3.8124 | 39800 | 0.0279        | -      | -                            |
+| 3.8219 | 39900 | 0.0283        | -      | -                            |
+| 3.8315 | 40000 | 0.0239        | 0.7982 | 0.7424                       |
+| 3.8411 | 40100 | 0.0378        | -      | -                            |
+| 3.8507 | 40200 | 0.028         | -      | -                            |
+| 3.8602 | 40300 | 0.0321        | -      | -                            |
+| 3.8698 | 40400 | 0.0289        | -      | -                            |
+| 3.8794 | 40500 | 0.027         | -      | -                            |
+| 3.8890 | 40600 | 0.0224        | -      | -                            |
+| 3.8986 | 40700 | 0.0236        | -      | -                            |
+| 3.9081 | 40800 | 0.0267        | -      | -                            |
+| 3.9177 | 40900 | 0.0228        | -      | -                            |
+| 3.9273 | 41000 | 0.0322        | 0.8101 | 0.7415                       |
+| 3.9369 | 41100 | 0.0262        | -      | -                            |
+| 3.9465 | 41200 | 0.0276        | -      | -                            |
+| 3.9560 | 41300 | 0.0292        | -      | -                            |
+| 3.9656 | 41400 | 0.0278        | -      | -                            |
+| 3.9752 | 41500 | 0.0262        | -      | -                            |
+| 3.9848 | 41600 | 0.0306        | -      | -                            |
+| 3.9943 | 41700 | 0.0238        | -      | -                            |
+| 4.0039 | 41800 | 0.0165        | -      | -                            |
+| 4.0135 | 41900 | 0.0241        | -      | -                            |
+| 4.0231 | 42000 | 0.0211        | 0.8092 | 0.742                        |
+| 4.0327 | 42100 | 0.0257        | -      | -                            |
+| 4.0422 | 42200 | 0.0236        | -      | -                            |
+| 4.0518 | 42300 | 0.0254        | -      | -                            |
+| 4.0614 | 42400 | 0.0248        | -      | -                            |
+| 4.0710 | 42500 | 0.026         | -      | -                            |
+| 4.0806 | 42600 | 0.0245        | -      | -                            |
+| 4.0901 | 42700 | 0.0325        | -      | -                            |
+| 4.0997 | 42800 | 0.0209        | -      | -                            |
+| 4.1093 | 42900 | 0.033         | -      | -                            |
+| 4.1189 | 43000 | 0.0265        | 0.8105 | 0.7412                       |
+| 4.1285 | 43100 | 0.027         | -      | -                            |
+| 4.1380 | 43200 | 0.0208        | -      | -                            |
+| 4.1476 | 43300 | 0.0179        | -      | -                            |
+| 4.1572 | 43400 | 0.0194        | -      | -                            |
+| 4.1668 | 43500 | 0.0217        | -      | -                            |
+| 4.1763 | 43600 | 0.0212        | -      | -                            |
+| 4.1859 | 43700 | 0.0226        | -      | -                            |
+| 4.1955 | 43800 | 0.0252        | -      | -                            |
+| 4.2051 | 43900 | 0.0293        | -      | -                            |
+| 4.2147 | 44000 | 0.0216        | 0.8029 | 0.7414                       |
+| 4.2242 | 44100 | 0.029         | -      | -                            |
+| 4.2338 | 44200 | 0.0216        | -      | -                            |
+| 4.2434 | 44300 | 0.0251        | -      | -                            |
+| 4.2530 | 44400 | 0.018         | -      | -                            |
+| 4.2626 | 44500 | 0.025         | -      | -                            |
+| 4.2721 | 44600 | 0.0225        | -      | -                            |
+| 4.2817 | 44700 | 0.0303        | -      | -                            |
+| 4.2913 | 44800 | 0.028         | -      | -                            |
+| 4.3009 | 44900 | 0.0203        | -      | -                            |
+| 4.3104 | 45000 | 0.026         | 0.8081 | 0.7405                       |
+</details>
+### Framework Versions
+- Python: 3.10.12
+- Sentence Transformers: 3.0.0
+- Transformers: 4.38.2
+- PyTorch: 2.1.2+cu121
+- Accelerate: 0.27.2
+- Datasets: 2.19.1
+- Tokenizers: 0.15.2
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### CachedMultipleNegativesRankingLoss
+```bibtex
+@misc{gao2021scaling,
+    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
+    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
+    year={2021},
+    eprint={2101.06983},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "_name_or_path": "models/nomic-embed-text-esci/checkpoint-45000",
+  "activation_function": "swiglu",
+  "architectures": [
+    "NomicBertModel"
+  ],
+  "attn_pdrop": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_hf_nomic_bert.NomicBertConfig",
+    "AutoModel": "modeling_hf_nomic_bert.NomicBertModel",
+    "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining"
+  },
+  "bos_token_id": null,
+  "causal": false,
+  "dense_seq_output": true,
+  "embd_pdrop": 0.0,
+  "eos_token_id": null,
+  "fused_bias_fc": true,
+  "fused_dropout_add_ln": true,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-12,
+  "max_trained_positions": 2048,
+  "mlp_fc1_bias": false,
+  "mlp_fc2_bias": false,
+  "model_type": "nomic_bert",
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": 3072,
+  "n_layer": 12,
+  "n_positions": 8192,
+  "pad_vocab_size_multiple": 64,
+  "parallel_block": false,
+  "parallel_block_tied_norm": false,
+  "prenorm": false,
+  "qkv_proj_bias": false,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.0,
+  "rotary_emb_base": 1000,
+  "rotary_emb_fraction": 1.0,
+  "rotary_emb_interleaved": false,
+  "rotary_emb_scale_base": null,
+  "rotary_scaling_factor": null,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.0,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.38.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "use_flash_attn": true,
+  "use_rms_norm": false,
+  "use_xentropy": true,
+  "vocab_size": 30528
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "2.4.0.dev0",
+    "transformers": "4.37.2",
+    "pytorch": "2.1.0+cu121"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": null
+}

configuration_hf_nomic_bert.py ADDED Viewed

	@@ -0,0 +1,56 @@

+from transformers import GPT2Config
+class NomicBertConfig(GPT2Config):
+    model_type = "nomic_bert"
+    def __init__(
+        self,
+        prenorm=False,
+        parallel_block=False,
+        parallel_block_tied_norm=False,
+        rotary_emb_fraction=0.0,
+        fused_dropout_add_ln=False,
+        fused_bias_fc=False,
+        use_flash_attn=False,
+        use_xentropy=False,
+        qkv_proj_bias=True,
+        rotary_emb_base=10_000,
+        rotary_emb_scale_base=None,
+        rotary_emb_interleaved=False,
+        mlp_fc1_bias=True,
+        mlp_fc2_bias=True,
+        use_rms_norm=False,
+        causal=False,
+        type_vocab_size=2,
+        dense_seq_output=True,
+        pad_vocab_size_multiple=1,
+        tie_word_embeddings=True,
+        rotary_scaling_factor=None,
+        max_trained_positions=2048,
+        **kwargs,
+    ):
+        self.prenorm = prenorm
+        self.parallel_block = parallel_block
+        self.parallel_block_tied_norm = parallel_block_tied_norm
+        self.rotary_emb_fraction = rotary_emb_fraction
+        self.tie_word_embeddings = tie_word_embeddings
+        self.fused_dropout_add_ln = fused_dropout_add_ln
+        self.fused_bias_fc = fused_bias_fc
+        self.use_flash_attn = use_flash_attn
+        self.use_xentropy = use_xentropy
+        self.qkv_proj_bias = qkv_proj_bias
+        self.rotary_emb_base = rotary_emb_base
+        self.rotary_emb_scale_base = rotary_emb_scale_base
+        self.rotary_emb_interleaved = rotary_emb_interleaved
+        self.mlp_fc1_bias = mlp_fc1_bias
+        self.mlp_fc2_bias = mlp_fc2_bias
+        self.use_rms_norm = use_rms_norm
+        self.causal = causal
+        self.type_vocab_size = type_vocab_size
+        self.dense_seq_output = dense_seq_output
+        self.pad_vocab_size_multiple = pad_vocab_size_multiple
+        self.rotary_scaling_factor = rotary_scaling_factor
+        self.max_trained_positions = max_trained_positions
+        super().__init__(**kwargs)

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:69fe41349d5efc8669c5d8ac9e0fe86fec944f8f2886d10641b6ab278c7f634b
+size 546938168

modeling_hf_nomic_bert.py ADDED Viewed

	@@ -0,0 +1,1234 @@

+# Copyright (c) 2022, Tri Dao.
+# This BERT implementation is based on our MLPerf 2.0 and MLPerf 2.1 BERT implementation.
+# https://github.com/mlcommons/training_results_v2.0/blob/main/HazyResearch/benchmarks/bert/implementations/pytorch/modeling.py
+# https://github.com/mlcommons/training_results_v2.1/blob/main/Azure-HazyResearch/benchmarks/bert/implementations/ND96amsr_A100_v4/modeling.py
+import logging
+# Inspired by https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py
+import os
+import re
+from collections import OrderedDict
+from functools import partial
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import rearrange, repeat
+from safetensors.torch import load_file as safe_load_file
+from transformers import GPT2Config, PreTrainedModel
+from transformers.models.bert.modeling_bert import (
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    MaskedLMOutput,
+    SequenceClassifierOutput,
+)
+from transformers.utils import SAFE_WEIGHTS_INDEX_NAME, SAFE_WEIGHTS_NAME, WEIGHTS_INDEX_NAME, WEIGHTS_NAME
+from transformers.utils.hub import cached_file, get_checkpoint_shard_files
+from .configuration_hf_nomic_bert import NomicBertConfig
+logger = logging.getLogger(__name__)
+# adapted from flash attention, added safe serialization option for hf models
+def state_dict_from_pretrained(model_name, safe_serialization=False, device=None, dtype=None):
+    # If not fp32, then we don't want to load directly to the GPU
+    mapped_device = "cpu" if dtype not in [torch.float32, None] else device
+    is_sharded = False
+    load_safe = False
+    resolved_archive_file = None
+    weights_path = os.path.join(model_name, WEIGHTS_NAME)
+    weights_index_path = os.path.join(model_name, WEIGHTS_INDEX_NAME)
+    safe_weights_path = os.path.join(model_name, SAFE_WEIGHTS_NAME)
+    safe_weights_index_path = os.path.join(model_name, SAFE_WEIGHTS_INDEX_NAME)
+    if os.path.isfile(weights_path):
+        resolved_archive_file = cached_file(model_name, WEIGHTS_NAME, _raise_exceptions_for_missing_entries=False)
+    elif os.path.isfile(weights_index_path):
+        resolved_archive_file = cached_file(model_name, WEIGHTS_INDEX_NAME, _raise_exceptions_for_missing_entries=False)
+        is_sharded = True
+    elif os.path.isfile(safe_weights_path):
+        resolved_archive_file = cached_file(model_name, SAFE_WEIGHTS_NAME, _raise_exceptions_for_missing_entries=False)
+        load_safe = True
+    elif os.path.isfile(safe_weights_index_path):
+        resolved_archive_file = cached_file(
+            model_name, SAFE_WEIGHTS_INDEX_NAME, _raise_exceptions_for_missing_entries=False
+        )
+        is_sharded = True
+        load_safe = True
+    else:  # Try loading from HF hub instead of from local files
+        resolved_archive_file = None
+        for weight_name in [WEIGHTS_NAME, SAFE_WEIGHTS_NAME, WEIGHTS_INDEX_NAME, SAFE_WEIGHTS_INDEX_NAME]:
+            resolved_archive_file = cached_file(
+                model_name, weight_name, _raise_exceptions_for_missing_entries=False
+            )
+            if resolved_archive_file is not None:
+                if weight_name in [SAFE_WEIGHTS_NAME, SAFE_WEIGHTS_INDEX_NAME]:
+                    load_safe = True
+                if weight_name in [WEIGHTS_INDEX_NAME, SAFE_WEIGHTS_INDEX_NAME]:
+                    is_sharded = True
+                break
+    if resolved_archive_file is None:
+        raise EnvironmentError(f"Model name {model_name} was not found.")
+    if load_safe:
+        loader = partial(safe_load_file, device=mapped_device)
+    else:
+        loader = partial(torch.load, map_location=mapped_device)
+    if is_sharded:
+        # resolved_archive_file becomes a list of files that point to the different
+        # checkpoint shards in this case.
+        resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(model_name, resolved_archive_file)
+        state_dict = {}
+        for sharded_file in resolved_archive_file:
+            state_dict.update(loader(sharded_file))
+    else:
+        state_dict = loader(resolved_archive_file)
+    # Convert dtype before moving to GPU to save memory
+    if dtype is not None:
+        state_dict = {k: v.to(dtype=dtype) for k, v in state_dict.items()}
+    state_dict = {k: v.to(device=device) for k, v in state_dict.items()}
+    return state_dict
+def filter_shapes(state_dict, model):
+    """
+    Filters the state dict to match the current model shape.
+    """
+    filtered_state_dict = {}
+    for key, value in state_dict.items():
+        if key in model.state_dict():
+            if value.shape == model.state_dict()[key].shape:
+                filtered_state_dict[key] = value
+    return filtered_state_dict
+def remap_bert_state_dict(
+    state_dict,
+    config,
+    remove_bert=False,
+    remove_cls_weights=False,
+    add_pooling_layer=False,
+):
+    """
+    Map the state_dict of a Huggingface BERT model to be flash_attn compatible.
+    """
+    def add_bert_prefix(key):
+        # prepend bert. to the key
+        if key.startswith("bert.") or key.startswith("cls."):
+            return key
+        return f"bert.{key}"
+    state_dict = OrderedDict((add_bert_prefix(k), v) for k, v in state_dict.items())
+    # LayerNorm
+    def key_mapping_ln_gamma_beta(key):
+        key = re.sub(r"LayerNorm.gamma$", "LayerNorm.weight", key)
+        key = re.sub(r"LayerNorm.beta$", "LayerNorm.bias", key)
+        return key
+    state_dict = OrderedDict((key_mapping_ln_gamma_beta(k), v) for k, v in state_dict.items())
+    # Layers
+    def key_mapping_layers(key):
+        return re.sub(r"^bert.encoder.layer\.", "bert.encoder.layers.", key)
+    state_dict = OrderedDict((key_mapping_layers(k), v) for k, v in state_dict.items())
+    # LayerNorm
+    def key_mapping_ln(key):
+        key = re.sub(r"^bert.embeddings.LayerNorm.", "bert.emb_ln.", key)
+        key = re.sub(
+            r"^bert.encoder.layers.(\d+).attention.output.LayerNorm.(weight|bias)",
+            r"bert.encoder.layers.\1.norm1.\2",
+            key,
+        )
+        key = re.sub(
+            r"^bert.encoder.layers.(\d+).output.LayerNorm.(weight|bias)",
+            r"bert.encoder.layers.\1.norm2.\2",
+            key,
+        )
+        key = re.sub(
+            r"^cls.predictions.transform.LayerNorm.(weight|bias)",
+            r"cls.predictions.transform.layer_norm.\1",
+            key,
+        )
+        return key
+    state_dict = OrderedDict((key_mapping_ln(k), v) for k, v in state_dict.items())
+    # MLP
+    def key_mapping_mlp(key):
+        key = re.sub(
+            r"^bert.encoder.layers.(\d+).intermediate.dense.(weight|bias)",
+            r"bert.encoder.layers.\1.mlp.fc1.\2",
+            key,
+        )
+        key = re.sub(
+            r"^bert.encoder.layers.(\d+).output.dense.(weight|bias)",
+            r"bert.encoder.layers.\1.mlp.fc2.\2",
+            key,
+        )
+        return key
+    state_dict = OrderedDict((key_mapping_mlp(k), v) for k, v in state_dict.items())
+    # Attention
+    last_layer_subset = getattr(config, "last_layer_subset", False)
+    for d in range(config.num_hidden_layers):
+        if f"bert.encoder.layers.{d}.attention.self.query.weight" not in state_dict:
+            continue
+        Wq = state_dict.pop(f"bert.encoder.layers.{d}.attention.self.query.weight")
+        Wk = state_dict.pop(f"bert.encoder.layers.{d}.attention.self.key.weight")
+        Wv = state_dict.pop(f"bert.encoder.layers.{d}.attention.self.value.weight")
+        bq = state_dict.pop(f"bert.encoder.layers.{d}.attention.self.query.bias")
+        bk = state_dict.pop(f"bert.encoder.layers.{d}.attention.self.key.bias")
+        bv = state_dict.pop(f"bert.encoder.layers.{d}.attention.self.value.bias")
+        if not (last_layer_subset and d == config.num_hidden_layers - 1):
+            state_dict[f"bert.encoder.layers.{d}.attn.Wqkv.weight"] = torch.cat([Wq, Wk, Wv], dim=0)
+            state_dict[f"bert.encoder.layers.{d}.attn.Wqkv.bias"] = torch.cat([bq, bk, bv], dim=0)
+        else:
+            state_dict[f"bert.encoder.layers.{d}.attn.Wq.weight"] = Wq
+            state_dict[f"bert.encoder.layers.{d}.attn.Wkv.weight"] = torch.cat([Wk, Wv], dim=0)
+            state_dict[f"bert.encoder.layers.{d}.attn.Wq.bias"] = bq
+            state_dict[f"bert.encoder.layers.{d}.attn.Wkv.bias"] = torch.cat([bk, bv], dim=0)
+    def key_mapping_attn(key):
+        return re.sub(
+            r"^bert.encoder.layers.(\d+).attention.output.dense.(weight|bias)",
+            r"bert.encoder.layers.\1.attn.out_proj.\2",
+            key,
+        )
+    state_dict = OrderedDict((key_mapping_attn(k), v) for k, v in state_dict.items())
+    def key_mapping_decoder_bias(key):
+        return re.sub(r"^cls.predictions.bias", "cls.predictions.decoder.bias", key)
+    # remove nsp weights, we don't use
+    state_dict.pop("cls.seq_relationship.weight", None)
+    state_dict.pop("cls.seq_relationship.bias", None)
+    state_dict.pop("bert.embeddings.position_ids", None)
+    state_dict = OrderedDict((key_mapping_decoder_bias(k), v) for k, v in state_dict.items())
+    if remove_cls_weights:
+        cls_weights = [
+            "cls.predictions.decoder.bias",
+            "cls.predictions.transform.dense.weight",
+            "cls.predictions.transform.dense.bias",
+            "cls.predictions.transform.layer_norm.weight",
+            "cls.predictions.transform.layer_norm.bias",
+            "cls.predictions.decoder.weight",
+        ]
+        for weight in cls_weights:
+            state_dict.pop(weight, None)
+    # Word embedding
+    pad_vocab_size_multiple = getattr(config, "pad_vocab_size_multiple", 1)
+    if pad_vocab_size_multiple > 1:
+        word_embeddings = state_dict["bert.embeddings.word_embeddings.weight"]
+        state_dict["bert.embeddings.word_embeddings.weight"] = F.pad(
+            word_embeddings, (0, 0, 0, config.vocab_size - word_embeddings.shape[0])
+        )
+        if not remove_cls_weights:
+            decoder_weight = state_dict["cls.predictions.decoder.weight"]
+            state_dict["cls.predictions.decoder.weight"] = F.pad(
+                decoder_weight, (0, 0, 0, config.vocab_size - decoder_weight.shape[0])
+            )
+            # If the vocab was padded, we want to set the decoder bias for those padded indices to be
+            # strongly negative (i.e. the decoder shouldn't predict those indices).
+            # TD [2022-05-09]: I don't think it affects the MLPerf training.
+            if "cls.predictions.decoder.bias" in state_dict:
+                decoder_bias = state_dict["cls.predictions.decoder.bias"]
+                state_dict["cls.predictions.decoder.bias"] = F.pad(
+                    decoder_bias, (0, config.vocab_size - decoder_bias.shape[0]), value=-100.0
+                )
+    if add_pooling_layer is False:
+        pooler_weights = [
+            "bert.pooler.dense.weight",
+            "bert.pooler.dense.bias",
+        ]
+        for key in pooler_weights:
+            state_dict.pop(key, None)
+    if remove_bert:
+        def remove_bert_prefix(key):
+            key = re.sub(r"^bert.", "", key)
+            return key
+        state_dict = OrderedDict((remove_bert_prefix(k), v) for k, v in state_dict.items())
+    return state_dict
+class NomicBertPreTrainedModel(PreTrainedModel):
+    """An abstract class to handle weights initialization and
+    a simple interface for dowloading and loading pretrained models.
+    """
+    config_class = NomicBertConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["Block"]
+    _skip_keys_device_placement = "past_key_values"
+    def __init__(self, config, *inputs, **kwargs):
+        super().__init__(config)
+        if not isinstance(config, GPT2Config):
+            raise ValueError(
+                "Parameter config in `{}(config)` should be an instance of class `GPT2Config`. "
+                "To create a model from a Google pretrained model use "
+                "`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
+                    self.__class__.__name__, self.__class__.__name__
+                )
+            )
+        self.config = config
+    @classmethod
+    def from_pretrained(cls, model_name, config=None, *inputs, **kwargs):
+        """
+        Instantiate a NomicBertPreTrainedModel from a pre-trained model file or a pytorch state dict.
+        Download and cache the pre-trained model file if needed.
+        Params:
+            pretrained_model_name_or_path: either:
+                - a path or url to a pretrained model archive containing:
+                    . `bert_config.json` a configuration file for the model
+                    . `pytorch_model.bin` a PyTorch dump of a NomicBertForPretraining instance
+                - a path or url to a pretrained model archive containing:
+                    . `bert_config.json` a configuration file for the model
+                    . `model.chkpt` a TensorFlow checkpoint
+            *inputs, **kwargs: additional input for the specific NomicBert class
+                (ex: num_labels for NomicBertForSequenceClassification)
+        """
+        # Instantiate model.
+        if config is None:
+            config = cls.config_class.from_pretrained(model_name)
+        remove_cls = cls != NomicBertForPreTraining
+        remove_bert_prefix = cls != NomicBertForPreTraining and cls != NomicBertForSequenceClassification
+        ignore_mismatched_shapes = kwargs.pop("ignore_mismatched_sizes", False)
+        num_labels = kwargs.pop("num_labels", None)
+        rotary_scaling_factor = kwargs.pop("rotary_scaling_factor", None)
+        strict = kwargs.pop("strict", True)
+        if rotary_scaling_factor:
+            config.rotary_scaling_factor = rotary_scaling_factor
+        if config.n_positions <= 0 and config.rotary_emb_fraction > 0:
+            config.n_positions = 2048
+        if num_labels:
+            config.num_labels = num_labels
+        if "add_pooling_layer" in kwargs:
+            model = cls(config, *inputs, add_pooling_layer=kwargs.pop("add_pooling_layer"))
+        else:
+            if cls == NomicBertModel:
+                model = cls(config, *inputs, add_pooling_layer=False)
+            else:
+                model = cls(config, *inputs)
+        # TODO: fix this
+        # Assuming we know what we're doing when loading from disk
+        # Prob a bad assumption but i'm tired and want to train this asap
+        if os.path.exists(model_name):
+            model_path = f"{model_name}/pytorch_model.bin"
+            if os.path.exists(model_path):
+                state_dict = torch.load(f"{model_name}/pytorch_model.bin")
+            else:
+                model_path = f"{model_name}/model.safetensors"
+                if not os.path.exists(model_path):
+                    raise ValueError(f"Model path {model_path} not found")
+                state_dict = safe_load_file(model_path)
+            if ignore_mismatched_shapes:
+                state_dict = filter_shapes(state_dict, model)
+            load_return = model.load_state_dict(state_dict, strict=False)
+        else:
+            # TODO: can probably check config class and see if we need to remap from a bert model
+            state_dict = state_dict_from_pretrained(model_name)
+            state_dict = remap_bert_state_dict(
+                state_dict,
+                config,
+                remove_bert=remove_bert_prefix,
+                remove_cls_weights=remove_cls,
+                add_pooling_layer=getattr(config, "add_pooling_layer", False),
+            )
+            if ignore_mismatched_shapes:
+                state_dict = filter_shapes(state_dict, model)
+            load_return = model.load_state_dict(state_dict, strict=strict)
+        logger.warning(load_return)
+        return model
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, NomicBertEncoder):
+            module.gradient_checkpointing = value
+# https://github.com/huggingface/transformers/blob/7032e0203262ebb2ebf55da8d2e01f873973e835/src/transformers/models/bert/modeling_bert.py#L748
+def _init_weights(module, initializer_range=0.02):
+    if isinstance(module, nn.Linear):
+        nn.init.normal_(module.weight, std=initializer_range)
+        if module.bias is not None:
+            nn.init.zeros_(module.bias)
+    elif isinstance(module, nn.Embedding):
+        nn.init.normal_(module.weight, std=initializer_range)
+        if module.padding_idx is not None:
+            nn.init.zeros_(module.weight[module.padding_idx])
+class NomicBertEmbeddings(nn.Module):
+    def __init__(self, config):
+        """
+        If max_position_embeddings <= 0, there's no position embeddings
+        If type_vocab_size <= 0, there's no token type embeddings
+        """
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.max_position_embeddings = config.max_position_embeddings if config.rotary_emb_fraction <= 0 else 0
+        self.type_vocab_size = config.type_vocab_size
+        if self.max_position_embeddings > 0 and config.rotary_emb_fraction <= 0:
+            self.position_embeddings = nn.Embedding(
+                config.max_position_embeddings,
+                config.hidden_size,
+            )
+        if self.type_vocab_size > 0:
+            self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+    def forward(self, input_ids, position_ids=None, token_type_ids=None):
+        """
+        input_ids: (batch, seqlen)
+        position_ids: (batch, seqlen)
+        token_type_ids: (batch, seqlen)
+        """
+        batch_size, seqlen = input_ids.shape
+        embeddings = self.word_embeddings(input_ids)
+        if self.type_vocab_size > 0:
+            if token_type_ids is None:
+                token_type_ids = torch.zeros(seqlen, dtype=torch.long, device=input_ids.device)
+            token_type_embeddings = self.token_type_embeddings(token_type_ids)
+            embeddings = embeddings + token_type_embeddings
+        if self.max_position_embeddings > 0:
+            if position_ids is None:
+                position_ids = torch.arange(seqlen, dtype=torch.long, device=input_ids.device)
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings = embeddings + position_embeddings
+        return embeddings
+class NomicBertMLP(nn.Module):
+    def __init__(
+        self,
+        in_features,
+        hidden_features=None,
+        out_features=None,
+        activation=F.gelu,
+        bias1=True,
+        bias2=True,
+        return_residual=False,
+        fused_bias_fc=False,
+    ):
+        super().__init__()
+        out_features = out_features if out_features is not None else in_features
+        hidden_features = hidden_features if hidden_features is not None else in_features * 4
+        self.return_residual = return_residual
+        self.fc1 = nn.Linear(in_features, hidden_features, bias=bias1)
+        approximate = "tanh" if activation in ["gelu_new", "gelu_fast", "gelu_pytorch_tanh"] else "none"
+        self.activation = nn.GELU(approximate=approximate) if activation == "gelu" else activation
+        self.fc2 = nn.Linear(hidden_features, out_features, bias=bias2)
+    def forward(self, x):
+        y = self.fc1(x)
+        y = self.activation(y)
+        y = self.fc2(y)
+        return y if not self.return_residual else (y, x)
+class NomciBertGatedMLP(nn.Module):
+    def __init__(
+        self,
+        in_features,
+        hidden_features=None,
+        out_features=None,
+        activation=F.sigmoid,
+        bias1=True,
+        bias2=True,
+        multiple_of=256,
+        return_residual=False,
+        fused_bias_fc=True,
+        device=None,
+        dtype=None,
+    ):
+        super().__init__()
+        out_features = out_features if out_features is not None else in_features
+        hidden_features = hidden_features if hidden_features is not None else int(8 * in_features / 3)
+        hidden_features = (hidden_features + multiple_of - 1) // multiple_of * multiple_of
+        self.return_residual = return_residual
+        self.fc11 = nn.Linear(in_features, hidden_features, bias=bias1)
+        self.fc12 = nn.Linear(in_features, hidden_features, bias=bias1)
+        self.activation = activation
+        self.fc2 = nn.Linear(hidden_features, out_features, bias=bias2)
+    def forward(self, x):
+        y = self.fc11(x)
+        gate = self.fc12(x)
+        if self.activation == F.sigmoid:  # Special case for GLU
+            y = F.glu(torch.cat([y, gate], dim=-1), dim=-1)
+        else:
+            y = y * self.activation(gate)
+        y = self.fc2(y)
+        return y if not self.return_residual else (y, x)
+def rotate_half(x, interleaved=False):
+    if not interleaved:
+        x1, x2 = x.chunk(2, dim=-1)
+        return torch.cat((-x2, x1), dim=-1)
+    else:
+        x1, x2 = x[..., ::2], x[..., 1::2]
+        return rearrange(torch.stack((-x2, x1), dim=-1), "... d two -> ... (d two)", two=2)
+def apply_rotary_emb(x, cos, sin, offset=0, interleaved=False):
+    """
+    x: (batch_size, seqlen, nheads, headdim)
+    cos, sin: (seqlen, rotary_dim / 2) or (batch_size, seqlen, rotary_dim / 2)
+    """
+    ro_dim = cos.shape[-1] * 2
+    assert ro_dim <= x.shape[-1]
+    cos, sin = (
+        cos[offset : offset + x.shape[1]],
+        sin[offset : offset + x.shape[1]],
+    )
+    cos = repeat(cos, "... d -> ... 1 (2 d)" if not interleaved else "... d -> ... 1 (d 2)")
+    sin = repeat(sin, "... d -> ... 1 (2 d)" if not interleaved else "... d -> ... 1 (d 2)")
+    return torch.cat(
+        [x[..., :ro_dim] * cos + rotate_half(x[..., :ro_dim], interleaved) * sin, x[..., ro_dim:]],
+        dim=-1,
+    )
+class NomicBertRotaryEmbedding(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        base=10000.0,
+        interleaved=False,
+        scale_base=None,
+        pos_idx_in_fp32=True,
+        device=None,
+    ):
+        """
+        interleaved: if True, rotate pairs of even and odd dimensions (GPT-J style) instead
+            of 1st half and 2nd half (GPT-NeoX style).
+        pos_idx_in_fp32: if True, the position indices [0.0, ..., seqlen - 1] are in fp32,
+            otherwise they might be in lower precision.
+            This option was added because previously (before 2023-07-02), when we construct
+            the position indices, we use the dtype of self.inv_freq. In most cases this would
+            be fp32, but if the model is trained in pure bf16 (not mixed precision), then
+            self.inv_freq would be bf16, and the position indices are also in bf16.
+            Because of the limited precision of bf16 (e.g. 1995.0 is rounded to 2000.0), the
+            embeddings for some positions will coincide.
+            To maintain compatibility with models previously trained in pure bf16,
+            we add this option.
+        """
+        super().__init__()
+        self.dim = dim
+        self.base = float(base)
+        self.pos_idx_in_fp32 = pos_idx_in_fp32
+        # Generate and save the inverse frequency buffer (non trainable)
+        inv_freq = self._compute_inv_freq(device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.interleaved = interleaved
+        self.scale_base = scale_base
+        scale = (
+            (torch.arange(0, dim, 2, device=device, dtype=torch.float32) + 0.4 * dim) / (1.4 * dim)
+            if scale_base is not None
+            else None
+        )
+        self.register_buffer("scale", scale, persistent=False)
+        self._seq_len_cached = 0
+        self._cos_cached = None
+        self._sin_cached = None
+        self._cos_k_cached = None
+        self._sin_k_cached = None
+    def _compute_inv_freq(self, device=None):
+        return 1.0 / (self.base ** (torch.arange(0, self.dim, 2, device=device, dtype=torch.float32) / self.dim))
+    def _update_cos_sin_cache(self, seqlen, device=None, dtype=None):
+        # Reset the tables if the sequence length has changed,
+        # if we're on a new device (possibly due to tracing for instance),
+        # or if we're switching from inference mode to training
+        if (
+            seqlen > self._seq_len_cached
+            or self._cos_cached is None
+            or self._cos_cached.device != device
+            or self._cos_cached.dtype != dtype
+            or (self.training and self._cos_cached.is_inference())
+        ):
+            self._seq_len_cached = seqlen
+            # We want fp32 here, not self.inv_freq.dtype, since the model could be loaded in bf16
+            # And the output of arange can be quite large, so bf16 would lose a lot of precision.
+            # However, for compatibility reason, we add an option to use the dtype of self.inv_freq.
+            if self.pos_idx_in_fp32:
+                t = torch.arange(seqlen, device=device, dtype=torch.float32)
+                # We want fp32 here as well since inv_freq will be multiplied with t, and the output
+                # will be large. Having it in bf16 will lose a lot of precision and cause the
+                # cos & sin output to change significantly.
+                # We want to recompute self.inv_freq if it was not loaded in fp32
+                if self.inv_freq.dtype != torch.float32:
+                    inv_freq = self._compute_inv_freq(device=device)
+                else:
+                    inv_freq = self.inv_freq
+            else:
+                t = torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)
+                inv_freq = self.inv_freq
+            # Don't do einsum, it converts fp32 to fp16 under AMP
+            # freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+            freqs = torch.outer(t, inv_freq)
+            self._cos_cached = torch.cos(freqs).to(dtype)
+            self._sin_cached = torch.sin(freqs).to(dtype)
+    def forward(
+        self,
+        qkv: torch.Tensor,
+        kv: Optional[torch.Tensor] = None,
+        seqlen_offset: Union[int, torch.Tensor] = 0,
+        max_seqlen: Optional[int] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        qkv: (batch, seqlen, 3, nheads, headdim) if kv is none,
+             else it's just q of shape (batch, seqlen, nheads, headdim)
+        kv: (batch, seqlen, 2, nheads, headdim)
+        seqlen_offset: (batch_size,) or int. Each sequence in x is shifted by this amount.
+            Most commonly used in inference when we have KV cache.
+            If it's a tensor of shape (batch_size,), then to update the cos / sin cache, one
+            should pass in max_seqlen, which will update the cos / sin cache up to that length.
+        Apply rotary embedding *inplace* to qkv and / or kv.
+        """
+        seqlen = qkv.shape[1]
+        if seqlen > self._seq_len_cached:
+            self._update_cos_sin_cache(seqlen, device=qkv.device, dtype=qkv.dtype)
+        elif max_seqlen is not None:
+            self._update_cos_sin_cache(max_seqlen, device=qkv.device, dtype=qkv.dtype)
+        elif isinstance(seqlen_offset, int):
+            self._update_cos_sin_cache(seqlen + seqlen_offset, device=qkv.device, dtype=qkv.dtype)
+        q_rot = apply_rotary_emb(qkv[:, :, 0], self._cos_cached, self._sin_cached, seqlen_offset, self.interleaved)
+        k_rot = apply_rotary_emb(qkv[:, :, 1], self._cos_cached, self._sin_cached, seqlen_offset, self.interleaved)
+        return torch.stack((q_rot, k_rot, qkv[:, :, 2]), dim=2)
+class NomicBertDynamicNTKRotaryEmbedding(NomicBertRotaryEmbedding):
+    def __init__(self, rotary_scaling_factor, max_position_embeddings, **kwargs):
+        super().__init__(**kwargs)
+        self.rotary_scaling_factor = rotary_scaling_factor
+        self.max_position_embeddings = max_position_embeddings
+    def _compute_inv_freq(self, base=None, device=None):
+        if base is None:
+            base = self.base
+        return 1.0 / (base ** (torch.arange(0, self.dim, 2, device=device, dtype=torch.float32) / self.dim))
+    def _update_cos_sin_cache(self, seqlen, device=None, dtype=None):
+        # Reset the tables if the sequence length has changed,
+        # if we're on a new device (possibly due to tracing for instance),
+        # or if we're switching from inference mode to training
+        if seqlen > self.max_position_embeddings:
+            base = self.base * (
+                (self.rotary_scaling_factor * seqlen / self.max_position_embeddings) - (self.rotary_scaling_factor - 1)
+            ) ** (self.dim / (self.dim - 2))
+            inv_freq = self._compute_inv_freq(base=base, device=device)
+            self.register_buffer("inv_freq", inv_freq, persistent=False)
+        if (
+            seqlen > self._seq_len_cached
+            or self._cos_cached is None
+            or self._cos_cached.device != device
+            or self._cos_cached.dtype != dtype
+            or (self.training and self._cos_cached.is_inference())
+        ):
+            self._seq_len_cached = seqlen
+            # We want fp32 here, not self.inv_freq.dtype, since the model could be loaded in bf16
+            # And the output of arange can be quite large, so bf16 would lose a lot of precision.
+            # However, for compatibility reason, we add an option to use the dtype of self.inv_freq.
+            if self.pos_idx_in_fp32:
+                t = torch.arange(seqlen, device=device, dtype=torch.float32)
+                # We want fp32 here as well since inv_freq will be multiplied with t, and the output
+                # will be large. Having it in bf16 will lose a lot of precision and cause the
+                # cos & sin output to change significantly.
+                # We want to recompute self.inv_freq if it was not loaded in fp32
+                if self.inv_freq.dtype != torch.float32:
+                    if seqlen > self.max_position_embeddings:
+                        base = self.base * (
+                            (self.scaling_factor * seqlen / self.max_position_embeddings) - (self.scaling_factor - 1)
+                        ) ** (self.dim / (self.dim - 2))
+                    else:
+                        base = self.base
+                    inv_freq = self._compute_inv_freq(device=device, base=base)
+                else:
+                    inv_freq = self.inv_freq
+            else:
+                t = torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)
+                inv_freq = self.inv_freq
+            # Don't do einsum, it converts fp32 to fp16 under AMP
+            # freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+            freqs = torch.outer(t, inv_freq)
+            if self.scale is None:
+                self._cos_cached = torch.cos(freqs).to(dtype)
+                self._sin_cached = torch.sin(freqs).to(dtype)
+            else:
+                power = (
+                    torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device) - seqlen // 2
+                ) / self.scale_base
+                scale = self.scale.to(device=power.device) ** rearrange(power, "s -> s 1")
+                # We want the multiplication by scale to happen in fp32
+                self._cos_cached = (torch.cos(freqs) * scale).to(dtype)
+                self._sin_cached = (torch.sin(freqs) * scale).to(dtype)
+                self._cos_k_cached = (torch.cos(freqs) / scale).to(dtype)
+                self._sin_k_cached = (torch.sin(freqs) / scale).to(dtype)
+class NomicBertAttention(nn.Module):
+    """Multi-head self-attention and cross-attention"""
+    def __init__(
+        self,
+        config,
+    ) -> None:
+        """
+        num_heads_kv: can be used to toggle MQA / GQA. If None, use num_heads.
+        return_residual: whether to return the input x along with the output. This is for
+            performance reason: for post-norm architecture, returning the input allows us
+            to fuse the backward of nn.Linear with the residual connection.
+        """
+        super().__init__()
+        self.embed_dim = config.n_embd
+        self.use_flash_attn = config.use_flash_attn
+        self.fused_bias_fc = config.fused_bias_fc
+        self.num_heads = config.n_head
+        self.num_heads_kv = config.num_heads_kv if getattr(config, "num_heads_kv", None) is not None else self.num_heads
+        assert self.embed_dim % self.num_heads == 0, "embed_dim must be divisible by num_heads"
+        self.head_dim = self.embed_dim // self.num_heads
+        # we don't really support mqa / gqa for now
+        qkv_dim = self.head_dim * (self.num_heads + 2 * self.num_heads_kv)
+        self.register_buffer(
+            "norm_factor",
+            torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)).to(torch.get_default_dtype()),
+            persistent=False,
+        )
+        self.rotary_emb_dim = self.head_dim * config.rotary_emb_fraction
+        if self.rotary_emb_dim > 0:
+            if getattr(config, "rotary_scaling_factor", None):
+                self.rotary_emb = NomicBertDynamicNTKRotaryEmbedding(
+                    dim=self.rotary_emb_dim,
+                    base=config.rotary_emb_base,
+                    scale_base=config.rotary_emb_scale_base,
+                    interleaved=config.rotary_emb_interleaved,
+                    rotary_scaling_factor=config.rotary_scaling_factor,
+                    max_position_embeddings=config.max_trained_positions,
+                )
+            else:
+                self.rotary_emb = NomicBertRotaryEmbedding(
+                    dim=self.rotary_emb_dim,
+                    base=config.rotary_emb_base,
+                    scale_base=config.rotary_emb_scale_base,
+                    interleaved=config.rotary_emb_interleaved,
+                )
+            # bug in xformers: https://github.com/facebookresearch/xformers/issues/841
+            # uses the head dimension instead of the sequence dimension
+            self.rotary_head_dim = getattr(config, "rotary_head_dim", False)
+        self.Wqkv = nn.Linear(self.embed_dim, qkv_dim, bias=config.qkv_proj_bias)
+        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.qkv_proj_bias)
+        self.causal = config.causal
+        self.drop = nn.Dropout(config.attn_pdrop)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        is_padded_inputs: Optional[bool] = True,
+        cu_seqlens: Optional[torch.Tensor] = None,
+        max_seq_len: Optional[int] = None,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        has_layer_past = past_key_value is not None
+        if has_layer_past:
+            past_key_value = past_key_value[0]
+            past_len = past_key_value[1]
+        else:
+            past_len = 0
+        qkv = self.Wqkv(hidden_states)
+        qkv = rearrange(qkv, "... (three h d) -> ... three h d", three=3, d=self.head_dim)
+        past_key_value = (past_key_value, past_len + qkv.size(1)) if use_cache else None
+        if self.rotary_emb_dim > 0:
+            if self.rotary_head_dim:
+                qkv = rearrange(qkv, "b s three h d -> b h three s d")
+            qkv = self.rotary_emb(qkv, seqlen_offset=past_len)
+            if self.rotary_head_dim:
+                qkv = rearrange(qkv, "b h three s d -> b s three h d")
+        query, key, value = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
+        query = query.permute(0, 2, 1, 3)
+        key = key.permute(0, 2, 1, 3)
+        value = value.permute(0, 2, 1, 3)
+        attention_scores = torch.matmul(query, key.transpose(-1, -2)) / self.norm_factor
+        if attention_mask is not None:
+            attention_scores = attention_scores + attention_mask
+        attentions_probs = F.softmax(attention_scores, dim=-1)
+        attentions_probs = self.drop(attentions_probs)
+        attn_output = torch.matmul(attentions_probs, value)
+        attn_output = rearrange(attn_output.permute(0, 2, 1, 3), "... h d -> ... (h d)")
+        attn_output = self.out_proj(attn_output)
+        return attn_output
+class NomicBertBlock(NomicBertPreTrainedModel):
+    def __init__(
+        self,
+        config,
+    ):
+        super().__init__(config=config)
+        self.prenorm = config.prenorm
+        self.fused_dropout_add_ln = config.fused_dropout_add_ln
+        self.attn = NomicBertAttention(config)
+        activation = (
+            F.sigmoid
+            if config.activation_function == "glu"
+            else (F.silu if config.activation_function == "swiglu" else F.gelu)
+        )
+        if config.activation_function in ["glu", "swiglu", "geglu"]:
+            self.mlp = NomciBertGatedMLP(
+                config.n_embd,
+                hidden_features=config.n_inner,
+                bias1=config.mlp_fc1_bias,
+                bias2=config.mlp_fc2_bias,
+                activation=activation,
+                fused_bias_fc=config.fused_bias_fc,
+            )
+        else:
+            self.mlp = NomicBertMLP(
+                config.n_embd,
+                hidden_features=config.n_inner,
+                bias1=config.mlp_fc1_bias,
+                bias2=config.mlp_fc2_bias,
+                activation=activation,
+                fused_bias_fc=config.fused_bias_fc,
+            )
+        self.dropout1 = nn.Dropout(config.resid_pdrop)
+        self.norm1 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
+        self.norm2 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
+        self.dropout2 = nn.Dropout(config.resid_pdrop)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        hidden_states2: torch.Tensor,
+        residual: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        is_padded_inputs: Optional[bool] = True,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cu_seqlens: Optional[torch.Tensor] = None,
+        max_seq_len: Optional[int] = None,
+    ):
+        r"""Pass the input through the encoder layer.
+        Args:
+            hidden_states: the sequence to the encoder layer (required).
+            residual: if postnorm, residual=None, If prenorm, hidden_states = Attn/MLP(LN(residual))
+            mixer_subset: for cross-attention only. If not None, will take a subset of x
+                before applying the query projection. Useful for e.g., ViT where we only care
+                about the CLS token in the last layer.
+        """
+        if self.prenorm:
+            dropped = self.dropout1(hidden_states)
+            residual = (dropped + residual) if residual is not None else dropped
+            hidden_states = self.norm1(residual.to(dtype=self.norm1.weight.dtype))
+            hidden_states = self.attn(
+                hidden_states,
+                attention_mask=attention_mask,
+                is_padded_inputs=is_padded_inputs,
+                cu_seqlens=cu_seqlens,
+                max_seq_len=max_seq_len,
+            )
+            dropped = self.dropout2(hidden_states)
+            residual = (dropped + residual) if residual is not None else dropped
+            hidden_states = self.norm2(residual.to(dtype=self.norm2.weight.dtype))
+            hidden_states = self.mlp(hidden_states)
+            return hidden_states, None, residual
+        else:
+            assert residual is None
+            attn_outputs = self.attn(
+                hidden_states,
+                attention_mask=attention_mask,
+                is_padded_inputs=is_padded_inputs,
+                cu_seqlens=cu_seqlens,
+                max_seq_len=max_seq_len,
+            )
+            hidden_states = self.norm1((self.dropout1(attn_outputs) + hidden_states).to(dtype=self.norm1.weight.dtype))
+            mlp_out = self.mlp(hidden_states)
+            hidden_states = self.norm2((self.dropout2(mlp_out) + hidden_states).to(dtype=self.norm2.weight.dtype))
+            return hidden_states, None, None
+class NomicBertEncoder(nn.Module):
+    def __init__(self, config: GPT2Config):
+        super().__init__()
+        self.layers = nn.ModuleList([NomicBertBlock(config) for _ in range(config.n_layer)])
+        self.gradient_checkpointing = False
+        self.config = config
+    def forward(
+        self,
+        hidden_states: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        is_padded_inputs: Optional[bool] = True,
+    ):
+        """If subset_mask is not None, we only want output for the subset of the sequence.
+        This means that we only compute the last layer output for these tokens.
+        subset_mask: (batch, seqlen), dtype=torch.bool
+        """
+        hidden_states2 = None
+        residual = None
+        for _, layer in enumerate(self.layers):
+            if self.gradient_checkpointing and self.training:
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        # None for past_key_value
+                        return module(*inputs)
+                    return custom_forward
+                hidden_states, hidden_states2, residual = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer),
+                    hidden_states,
+                    hidden_states2,
+                    residual,
+                    attention_mask,
+                    None,
+                    None,
+                    is_padded_inputs,
+                    # if you freeze ANY layers, you need `use_reentrant=False`
+                    # https://github.com/huggingface/transformers/issues/21381
+                    # https://discuss.pytorch.org/t/checkpoint-with-no-grad-requiring-inputs-problem/19117/7
+                    use_reentrant=False,
+                )
+            else:
+                hidden_states, hidden_states2, residual = layer(
+                    hidden_states,
+                    hidden_states2,
+                    residual,
+                    attention_mask,
+                    position_ids,
+                    None,
+                    is_padded_inputs,
+                    output_attentions,
+                    use_cache,
+                )
+        return hidden_states
+class NomicBertPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.n_embd, config.n_embd)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states, pool=True):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0] if pool else hidden_states
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+class NomicBertPredictionHeadTransform(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.n_embd, config.n_embd, bias=config.mlp_fc1_bias)
+        approximate = "tanh" if config.activation_function in ["gelu_new", "gelu_fast", "gelu_pytorch_tanh"] else "none"
+        if config.activation_function == "swiglu":
+            self.transform_act_fn = F.silu
+        else:
+            self.transform_act_fn = nn.GELU(approximate=approximate)
+        self.layer_norm = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.layer_norm(hidden_states)
+        return hidden_states
+class NomicBertLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.transform = NomicBertPredictionHeadTransform(config)
+        self.decoder = nn.Linear(config.n_embd, config.vocab_size, bias=config.mlp_fc1_bias)
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+class NomicBertPreTrainingHeads(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = NomicBertLMPredictionHead(config)
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+class NomicBertModel(NomicBertPreTrainedModel):
+    def __init__(self, config: GPT2Config, add_pooling_layer=True):
+        super().__init__(config)
+        self.pad_vocab_size_multiple = getattr(config, "pad_vocab_size_multiple", 1)
+        if config.vocab_size % self.pad_vocab_size_multiple != 0:
+            config.vocab_size += self.pad_vocab_size_multiple - (config.vocab_size % self.pad_vocab_size_multiple)
+        assert config.activation_function in [
+            "gelu",
+            "gelu_new",
+            "gelu_fast",
+            "gelu_pytorch_tanh",
+            "swiglu",
+            "geglu",
+            "glu",
+        ]
+        self.embeddings = NomicBertEmbeddings(config)
+        self.emb_drop = nn.Dropout(config.resid_pdrop)
+        self.emb_ln = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
+        self.encoder = NomicBertEncoder(config)
+        self.pooler = NomicBertPooler(config) if add_pooling_layer else None
+        self.apply(partial(_init_weights, initializer_range=config.initializer_range))
+    def forward(
+        self,
+        input_ids,
+        attention_mask=None,
+        position_ids=None,
+        token_type_ids=None,
+        return_dict=None,
+        matryoshka_dim=None,
+    ):
+        if token_type_ids is None:
+            token_type_ids = torch.zeros_like(input_ids)
+        hidden_states = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
+        hidden_states = self.emb_ln(hidden_states)
+        hidden_states = self.emb_drop(hidden_states)
+        attention_mask = self.get_extended_attention_mask(attention_mask, input_ids.shape)
+        sequence_output = self.encoder(hidden_states, attention_mask=attention_mask, return_dict=return_dict)
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+        if matryoshka_dim:
+            sequence_output = sequence_output[:, :matryoshka_dim]
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+        )
+class NomicBertForPreTraining(NomicBertPreTrainedModel):
+    _tied_weights_keys = ["predictions.decoder.bias", "cls.predictions.decoder.weight"]
+    def __init__(self, config: GPT2Config):
+        super().__init__(config)
+        self.bert = NomicBertModel(config, add_pooling_layer=getattr(config, "add_pooling_layer", False))
+        self.cls = NomicBertPreTrainingHeads(config)
+        self.mlm_loss = nn.CrossEntropyLoss()
+        # Initialize weights and apply final processing
+        self.apply(partial(_init_weights, initializer_range=config.initializer_range))
+        self.tie_weights()
+    def tie_weights(self):
+        self.cls.predictions.decoder.weight = self.bert.embeddings.word_embeddings.weight
+    def forward(
+        self,
+        input_ids,
+        position_ids=None,
+        token_type_ids=None,
+        attention_mask=None,
+        labels=None,
+    ):
+        """
+        If labels are provided, they must be -100 for masked out tokens (as specified in the attention
+        mask).
+        Outputs:
+            if `labels` and `next_sentence_label` are not `None`:
+                Outputs the total_loss which is the sum of the masked language modeling loss and the next
+                sentence classification loss.
+            if `labels` or `next_sentence_label` is `None`:
+                Outputs a tuple comprising
+                - the masked language modeling logits of shape [batch_size, sequence_length, vocab_size], and
+                - the next sentence classification logits of shape [batch_size, 2].
+        """
+        outputs = self.bert(
+            input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask.bool() if attention_mask is not None else None,
+        )
+        sequence_output, _ = outputs.last_hidden_state, outputs.pooler_output
+        prediction_scores = self.cls(sequence_output)
+        total_loss = None
+        if labels is not None:
+            masked_lm_loss = self.mlm_loss(
+                rearrange(prediction_scores, "... v -> (...) v"),
+                rearrange(labels, "... -> (...)"),
+            )
+            total_loss = masked_lm_loss.float()
+        return MaskedLMOutput(
+            loss=total_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=None,
+        )
+class NomicBertForSequenceClassification(NomicBertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.bert = NomicBertModel(config)
+        classifier_dropout = getattr(config, "classifier_dropout", config.embd_pdrop)
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.n_embd, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.bert(
+            input_ids,
+            position_ids=position_ids,
+            token_type_ids=token_type_ids,
+            attention_mask=attention_mask.bool() if attention_mask is not None else None,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 8192,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "max_length": 8192,
+  "model_max_length": 8192,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff