Add new SentenceTransformer model

Browse files

Files changed (13) hide show

.gitattributes +1 -0
1_Pooling/config.json +10 -0
README.md +1674 -0
config.json +50 -0
config_sentence_transformers.json +10 -0
configuration.py +145 -0
model.safetensors +3 -0
modeling.py +1418 -0
modules.json +20 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +51 -0
tokenizer.json +3 -0
tokenizer_config.json +61 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "word_embedding_dimension": 768,
+  "pooling_mode_cls_token": true,
+  "pooling_mode_mean_tokens": false,
+  "pooling_mode_max_tokens": false,
+  "pooling_mode_mean_sqrt_len_tokens": false,
+  "pooling_mode_weightedmean_tokens": false,
+  "pooling_mode_lasttoken": false,
+  "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,1674 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- generated_from_trainer
+- dataset_size:816532
+- loss:MultipleNegativesRankingLoss
+base_model: Alibaba-NLP/gte-multilingual-base
+widget:
+- source_sentence: 김택용이 스타크래프트2에서 첫 승리를 거둔 시기는 언제인가?
+  sentences:
+  - 2008년 11월 22일, 김택용은 클럽데이 온라인 MSL 결승전에서 허영무에게 선승을 내준 후 내리 3연승, 3:1 쾌승을 거두며 자신의
+    세 번째 MSL 우승을 달성하였다. 이를 통해 김택용은 프로토스 최초 개인리그 3회 우승자 및 역대 네 번째 금배지(MSL 3회 우승의 상징)
+    획득자가 되었다.
+  - '김택용은 새로 개막한 SK플래닛 프로리그 시즌2에서 스타크래프트: 브루드 워, 스타크래프트 Ⅱ를 병행해서 출전했다. 스타크래프트 브루드워
+    실력은 여전히 건재하지만, 스타Ⅱ에서는 스타크래프트 브루드워에서의 실력을 내지 못했다. 2012년 8월까지 택뱅리쌍 일원 중에서 김택용만 유일하게
+    스타Ⅱ에서의 승리를 하지 못했다. (0승 6패) 더군다나 2012년 봄까지만 해도 스타Ⅱ를 완전히 이해하지 못한듯한 플레이를 보이고 있었지만,
+    김택용은 2012년 여름이 되어서 스타Ⅱ를 서서히 실력을 쌓고 있었다. 기존의 스타크래프트 브루드워 스타리그가 스타크래프트 Ⅱ로 종목 전환한
+    뒤에 열린 첫 예선에 참가했으나, 스타Ⅱ의 부족한 실력을 여실히 들어내면서 1:2로 신예선수에게 지며 예선탈락하였다. 또한 GSL 선수들과
+    맞붙은 WCS 예선에서 프나틱의 장재호를 만나 무기력하게 0:2로 패배하여 탈락하였고, WCG 2012 예선에서도 백동준에게 0:2로 패배해
+    스타Ⅱ 종목으로 열린 경기에서 모두 패배하였다. 김택용은 스타2리그 뿐만아니라 스타1리그에서도 2010년 여름부터 3년째 스타리그에 이름을
+    올리지 못했다. 2012년 8월 12일 마침내 염보성을 상대로 어렵게 프로리그 스타2 종목에서 처음으로 승리를 거두었다(1승 6패). 결국
+    부진을 극복하지 못한 채 2012년 8월 케스파 랭킹 22위로까지 떨어지고 말았다. 하지만 그 후 2012년 8월 18일 김정우 마저 김택용의
+    스타2 승리 제물이 되었다. 엘리전까지 가는 혈전 끝에 스타Ⅱ에서 두각을 돋보이는 김정우를 격파하였고, 2012년 9월 2일 SK플래닛 스타
+    프로리그 시즌2 준플레이오프 2차전에서 다시 한번 염보성을 스타Ⅱ로 격파하면서 조금씩 기세를 올렸다.'
+  - 이소룡의 아버지는 유명한 광둥 경극 배우였으며, 아버지의 뒤를 이어 아주 어린 나이부터 영화를 접하게 되었고, 생후 3개월에 《금문녀》라는
+    영화로 데뷔하였다. 그가 18세가 되었을 때 이미 그는 스무 편의 영화에 출연한 상태였다.
+- source_sentence: 페니스가 없는 여성의 심리적 반응은 어떠한가?
+  sentences:
+  - PIRA는 무장해제위원회(Decommingsioning Commission)에 의해 2005년 10월 무장투쟁을 포기했음을 확인받았으며, 우익
+    민주연합당(DUP)를 제외한 정당들도 이를 인정했다. 단, DUP에서는 증거가 없다며 무장투쟁포기사실을 인정하지 않았는데, 이는 DUP가 PIRA를
+    통해서 존재할 수 있기 때문이다. 그 실례로 북아일랜드의 수도 벨파스트에서 발행하는 일간지에선 PIRA 지도자 오닐이 무장투쟁을 포기하자,
+    민주연합당 지도자 이언 페이즐리(Ian Paisley)가 "가지마! 난 네가 필요해!"라고 말하는 내용의 풍자만화를 실었다.
+  - 성적 만족을 위해서라면 정신적인 사랑 없이 육체적 결합이 가능하다고 주장하였다. 정분이 없이도 성교가 가능하며 성관계는 일종의 오락 내지는
+    친밀행위에 지나지 않는다고 보았다. 그러나 이는 보수적인 유학자들 외에도 남성 지식인과 기독교계열의 반발을 불러왔다.
+  - 첫째는 "자신에게 페니스가 없는"것을 강하게 자각하고, 완전하게 페니스가 없는 존재로 받아들일 것이다. 이것은 열등감을 가진 여자를 만든다.
+    이 경우 무기력한 인간이 되어버린다고 한다. 둘째는 "자신은 페니스가 언젠가 나오고, 나는 남자"라고 믿고, 남성적인 성격을 갖출 경우이다.
+    세 번째는 성기라는 대상을 선망할 때 성기를 "페니스 → 아이"라는 상징으로 생각하고, 아이를 손에 넣는 길을 선택하는 경우이다.
+- source_sentence: 신탁청은 언제 해체되었는가?
+  sentences:
+  - 신탁통치령(信託統治領, ) 혹은 신탁통치 지역(信託統治 地域)은 국제 연맹 위임통치령의 후신으로 제2차 세계 대전의 종전과 함께 국제 연맹이
+    유엔으로 대체됨에 따라 생겨났다.다음 11개 지역이 신탁통치령이었다. 1994년 10월 팔라우 독립을 마지막으로 신탁통치령은 소멸되었다.
+  - 히가시코게 역()은 일본 돗토리현 야즈 군 야즈 정에 위치한 서일본 여객철도 인비 선의 철도역이다. 단선 승강장 1면 1선의 구조를 갖춘 지상역이다.
+  - 신탁청은 1994년 12월 31일 해체될 때까지 15,102개의 기업체를 매각하고 4358개의 기업체를 재사유화했으며, 호텔, 식당, 약국
+    및 서점 등 소규모 사업장 25,030개를 사유화하고 46,552건의 부동산을 매각해 총 91,042건의 사유화를 기록했다. 이를 통해 666억
+    마르크의 매각수익을 올리고, 2111억 마르크의 투자와 150만 개의 일자리를 보장받았다. 초기에 추산되었던 기업가치가 약 6000억 마르크였던
+    것에 비하면 1/10 수준밖에 되지 않은 턱없이 낮은 매각수익이다. 사유화된 15,000여 기업 중 구동독인들에 의한 매입은― 주로 경영자기업인수(MBO)
+    혹은 종업원기업인수(EBO) ― 6%에 지나지않았고, 외국인 투자자 매입도 사유화 전체 기업 중 9% 정도로 나타났다.
+- source_sentence: 석신산의 탈수 반응 생성물은 무엇인가요?
+  sentences:
+  - 석신산은 푸마르산으로 산화되거나 다이에틸석시네이트(diethylsuccinate, (CHCOCHCH))와 같은 다이에스터로 전환될 수 있다.
+    이러한 다이에틸 에스터(diethyl ester)는 스토브 축합(Stobbe condensation) 반응의 기질이다. 석신산의 탈수는 석신산
+    무수물을 생성한다. 석신산은 1,4-뷰테인다이올, 말레산 무수물, 석신이미드, 2-피롤리디논 및 테트라하이드로푸란을 유도하는데 사용될 수 있다.
+  - 2006년 ‘동의대 5·3 동지회’ 회원 등은 “동의대 사건 이후 경찰 조사 과정에서 고문 등 인권침해가 있었다”며 진실·화해를 위한 과거사
+    정리 위원회(이하 진실화해위)에 진실규명을 신청하였다. 이로 인해 진실화해위 소위원회는 “구타 등 인권침해가 있어 국가가 사과해야 한다”는
+    내용의 조사 결과 보고서를 심의·의결, 2010년 1월 19일에 열린 진실화해위 전원위원회에 상정했으나, “진실화해위는 ‘권위주의 통치’ 시기에
+    일어난 일을 조사 대상으로 삼는데, 동의대 사건은 노태우 정권 시절에 일어난 일이므로 조사 대상 자체가 되지 않는다”며 재적위원 과반수가 이
+    사건을 각하하기로 의결해 사건이 각하되었다. 다음날인 1월 20일에는 조사하지 않기로 했다고 밝힘으로서, 보고서 내용은 논의조차 되지 못한
+    것으로 전해졌다.
+  - 저산소 상태에서 석신산의 축적은 활성 산소 생산의 증가에 의한 허혈 재관류 손상(reperfusion injury)과 관련이 있다. 허혈(ischemia)
+    동안 푸마르산은 퓨린 뉴클레오타이드의 분해 및 말산-아스파르트산 셔틀의 역방향 반응의 일부분으로부터 형성된다. 과도한 푸마르산은 석신산 탈수소효소의
+    역반응을 통해 석신산의 생산 및 축적을 야기한다. 재관류시 석신산은 신속하게 산화되어 활성산소의 갑작스럽고 광범위한 생성을 초래한다. 활성산소는
+    세포자살 기작을 촉발시키거나 단백질, 세포막, 세포소기관 등에 산화적 손상을 유발한다. 동물 모델에서 허혈성 석신산 축적의 약리학적 억제는
+    허혈 재관류 손상을 개선시켰다. 현재 석신산 매개 활성산소 생성의 억제는 약물 치료의 표적으로 조사 중이다.
+- source_sentence: 파올로 말디니는 어떤 선수인가요?
+  sentences:
+  - 체사레 말디니는 1954년부터 1966년까지 AC 밀란에서 뛰었고, 아들 파올로 말디니는 1985년부터 2009년까지 AC 밀란에서 뛰었으며,
+    손자 크리스티안 말디니가 2005년 10월 18일 AC 밀란 유스팀에 입단해 3부자가 모두 AC 밀란에서 활약하게 되었다.
+  - 파올로 체사레 말디니 (, 1968년 6월 26일, 이탈리아 밀라노 ~ )는 이탈리아의 은퇴한 축구 선수로, 포지션은 왼쪽 풀백과 센터백이었다.
+    그는 밀란의 전설적인 수비수 였을 뿐 아니라 역대 최고 수비수로도 불릴 만큼 대단한 선수였다. 현재 밀란의 스포츠 전략 & 개발 디렉터로 활동하고
+    있다.
+  - 조 주니어(Joe Junior, 본명은 Jose Maria Rodrigues, Jr.(조즈 마리아 로드리게스 주니어), 중문명(中文名)은 羅利期(뤄리지,
+    나이기), 1947년 7월 22일 ~ )는 영국 국적자 신분의 포르투갈계 영국인 남성으로 중화인민공화국 마카오 특별행정구에서 출생한 중화인민공화국
+    홍콩 특별행정구의 가수, 작사가, 영화배우, 텔레비전 연기자이다.
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+---
+# SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) <!-- at revision 3f013725dc4dcee1e4ca72d1ce7e053c0dcee5ef -->
+- **Maximum Sequence Length:** 1024 tokens
+- **Output Dimensionality:** 768 tokens
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: NewModel
+  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
+  (2): Normalize()
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("seongil-dn/gte-base-250k-answerableHN")
+# Run inference
+sentences = [
+    '파올로 말디니는 어떤 선수인가요?',
+    '파올로 체사레 말디니 (, 1968년 6월 26일, 이탈리아 밀라노 ~ )는 이탈리아의 은퇴한 축구 선수로, 포지션은 왼쪽 풀백과 센터백이었다. 그는 밀란의 전설적인 수비수 였을 뿐 아니라 역대 최고 수비수로도 불릴 만큼 대단한 선수였다. 현재 밀란의 스포츠 전략 & 개발 디렉터로 활동하고 있다.',
+    '체사레 말디니는 1954년부터 1966년까지 AC 밀란에서 뛰었고, 아들 파올로 말디니는 1985년부터 2009년까지 AC 밀란에서 뛰었으며, 손자 크리스티안 말디니가 2005년 10월 18일 AC 밀란 유스팀에 입단해 3부자가 모두 AC 밀란에서 활약하게 되었다.',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 768]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### Unnamed Dataset
+* Size: 816,532 training samples
+* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | anchor                                                                            | positive                                                                             | negative                                                                              |
+  |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
+  | type    | string                                                                            | string                                                                               | string                                                                                |
+  | details | <ul><li>min: 9 tokens</li><li>mean: 17.22 tokens</li><li>max: 32 tokens</li></ul> | <ul><li>min: 46 tokens</li><li>mean: 144.47 tokens</li><li>max: 621 tokens</li></ul> | <ul><li>min: 46 tokens</li><li>mean: 169.92 tokens</li><li>max: 1024 tokens</li></ul> |
+* Samples:
+  | anchor                         | positive                                                                                                                                                        | negative                                                                                                                                                                                                                                                                                                                                               |
+  |:-------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>별의 나이는 어떻게 측정하는가?</code> | <code>별의 나이는 토륨과 다른 성분들에 의해 만들어진 스펙트럼선들의 상대적인 힘을 측정하기 위해 초거대망원경의 자외선 분광기를 사용하여 추측한다. 선의 힘은 여러 가지 다양한 동위원소를 만들어내는데, 그것들로부터 핵우주 연대학을 사용하여 별의 나이를 짐작하는 것이다.</code> | <code>아들이 아버지보다 나이가 많을 수 없는 것처럼, 우주 안의 천체는 당연히 우주보다는 젊어야 하기 때문에, 여러 종류의 천체를 관측하여 그 나이를 추정하는 것으로 우주의 나이의 하한선을 얻을 수 있다. 가장 많이 쓰이는 방법 중 하나는 가장 온도가 낮은 백색왜성의 나이를 측정하는 것이다. 백색왜성은 태양과 비슷한 질량을 가진 별들이 죽으면서 만들어지는데, 백색왜성은 당시 가지고 있던 열 이외에 다른 에너지원이 없기 때문에 나이가 들면서 점점 식고, 어두워지게 된다. 따라서 가장 어둡고, 가장 온도가 낮은 백색왜성을 찾아서 그 냉각 나이를 측정하면 우주의 나이의 하한선을 얻을 수 있다.</code> |
+  | <code>별의 나이는 어떻게 측정하는가?</code> | <code>별의 나이는 토륨과 다른 성분들에 의해 만들어진 스펙트럼선들의 상대적인 힘을 측정하기 위해 초거대망원경의 자외선 분광기를 사용하여 추측한다. 선의 힘은 여러 가지 다양한 동위원소를 만들어내는데, 그것들로부터 핵우주 연대학을 사용하여 별의 나이를 짐작하는 것이다.</code> | <code>이 별의 물리적 수치는 태양과 비슷한데 분광형이 태양과 똑같은 G2V 여서 유사 태양으로 분류할 수 있다. 질량은 태양보다 9 퍼센트 무겁고 반지름은 태양보다 1 퍼센트 작다. 나이는 상대적으로 젊어 약 8천만 ~ 2억 년으로 보인다. 젊은 별인만큼 자전 속도는 3.5일에 한 번 돌 정도로 빠르며 자전축은 시선방향에 대해 21도(오차범위 +8, -9도) 기울어져 있다.</code>                                                                                                                           |
+  | <code>별의 나이는 어떻게 측정하는가?</code> | <code>별의 나이는 토륨과 다른 성분들에 의해 만들어진 스펙트럼선들의 상대적인 힘을 측정하기 위해 초거대망원경의 자외선 분광기를 사용하여 추측한다. 선의 힘은 여러 가지 다양한 동위원소를 만들어내는데, 그것들로부터 핵우주 연대학을 사용하여 별의 나이를 짐작하는 것이다.</code> | <code>여기서 "v"는 적도에서의 각속도이며 "t"는 별의 나이이다. 이 관계식은 1972년 앤드류 P. 스쿠마니치가 발견했으며 그의 이름을 따서 '스쿠마니치의 법칙'으로 불린다. 자이로연대학(Gyrochronology)은 태양의 속도를 기준점으로 한 항성의 자전 속도에 기초하여, 그 별의 나이를 결정하는 것이다.</code>                                                                                                                                                              |
+* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim"
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `per_device_train_batch_size`: 40
+- `gradient_accumulation_steps`: 4
+- `learning_rate`: 0.0001
+- `adam_epsilon`: 1e-07
+- `num_train_epochs`: 1
+- `warmup_ratio`: 0.1
+- `bf16`: True
+- `batch_sampler`: no_duplicates
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `eval_strategy`: no
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 40
+- `per_device_eval_batch_size`: 8
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 4
+- `eval_accumulation_steps`: None
+- `torch_empty_cache_steps`: None
+- `learning_rate`: 0.0001
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-07
+- `max_grad_norm`: 1.0
+- `num_train_epochs`: 1
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.1
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `restore_callback_states_from_checkpoint`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: True
+- `fp16`: False
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: True
+- `dataloader_num_workers`: 0
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: False
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: False
+- `hub_always_push`: False
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `eval_do_concat_batches`: True
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `dispatch_batches`: None
+- `split_batches`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_eval_metrics`: False
+- `eval_on_start`: False
+- `eval_use_gather_object`: False
+- `batch_sampler`: no_duplicates
+- `multi_dataset_batch_sampler`: proportional
+</details>
+### Training Logs
+<details><summary>Click to expand</summary>
+| Epoch  | Step | Training Loss |
+|:------:|:----:|:-------------:|
+| 0.0008 | 1    | 0.4813        |
+| 0.0016 | 2    | 0.5643        |
+| 0.0024 | 3    | 0.4872        |
+| 0.0031 | 4    | 0.3838        |
+| 0.0039 | 5    | 0.4269        |
+| 0.0047 | 6    | 0.434         |
+| 0.0055 | 7    | 0.5153        |
+| 0.0063 | 8    | 0.4429        |
+| 0.0071 | 9    | 0.4464        |
+| 0.0078 | 10   | 0.4187        |
+| 0.0086 | 11   | 0.468         |
+| 0.0094 | 12   | 0.402         |
+| 0.0102 | 13   | 0.3745        |
+| 0.0110 | 14   | 0.3623        |
+| 0.0118 | 15   | 0.3358        |
+| 0.0125 | 16   | 0.3927        |
+| 0.0133 | 17   | 0.4539        |
+| 0.0141 | 18   | 0.3177        |
+| 0.0149 | 19   | 0.2902        |
+| 0.0157 | 20   | 0.3559        |
+| 0.0165 | 21   | 0.2641        |
+| 0.0172 | 22   | 0.2968        |
+| 0.0180 | 23   | 0.2008        |
+| 0.0188 | 24   | 0.2742        |
+| 0.0196 | 25   | 0.3565        |
+| 0.0204 | 26   | 0.2706        |
+| 0.0212 | 27   | 0.2544        |
+| 0.0219 | 28   | 0.2721        |
+| 0.0227 | 29   | 0.2795        |
+| 0.0235 | 30   | 0.2647        |
+| 0.0243 | 31   | 0.164         |
+| 0.0251 | 32   | 0.2574        |
+| 0.0259 | 33   | 0.1962        |
+| 0.0267 | 34   | 0.2739        |
+| 0.0274 | 35   | 0.2286        |
+| 0.0282 | 36   | 0.2376        |
+| 0.0290 | 37   | 0.3125        |
+| 0.0298 | 38   | 0.2401        |
+| 0.0306 | 39   | 0.1922        |
+| 0.0314 | 40   | 0.2479        |
+| 0.0321 | 41   | 0.1851        |
+| 0.0329 | 42   | 0.1813        |
+| 0.0337 | 43   | 0.2471        |
+| 0.0345 | 44   | 0.2561        |
+| 0.0353 | 45   | 0.2568        |
+| 0.0361 | 46   | 0.3049        |
+| 0.0368 | 47   | 0.2404        |
+| 0.0376 | 48   | 0.231         |
+| 0.0384 | 49   | 0.261         |
+| 0.0392 | 50   | 0.2581        |
+| 0.0400 | 51   | 0.2184        |
+| 0.0408 | 52   | 0.2002        |
+| 0.0415 | 53   | 0.2586        |
+| 0.0423 | 54   | 0.1532        |
+| 0.0431 | 55   | 0.2023        |
+| 0.0439 | 56   | 0.2272        |
+| 0.0447 | 57   | 0.2207        |
+| 0.0455 | 58   | 0.2364        |
+| 0.0462 | 59   | 0.2044        |
+| 0.0470 | 60   | 0.2387        |
+| 0.0478 | 61   | 0.2289        |
+| 0.0486 | 62   | 0.1616        |
+| 0.0494 | 63   | 0.1753        |
+| 0.0502 | 64   | 0.1803        |
+| 0.0510 | 65   | 0.2033        |
+| 0.0517 | 66   | 0.2061        |
+| 0.0525 | 67   | 0.2128        |
+| 0.0533 | 68   | 0.2046        |
+| 0.0541 | 69   | 0.1685        |
+| 0.0549 | 70   | 0.1985        |
+| 0.0557 | 71   | 0.1713        |
+| 0.0564 | 72   | 0.21          |
+| 0.0572 | 73   | 0.2085        |
+| 0.0580 | 74   | 0.2144        |
+| 0.0588 | 75   | 0.2099        |
+| 0.0596 | 76   | 0.223         |
+| 0.0604 | 77   | 0.2342        |
+| 0.0611 | 78   | 0.2327        |
+| 0.0619 | 79   | 0.1812        |
+| 0.0627 | 80   | 0.2068        |
+| 0.0635 | 81   | 0.1826        |
+| 0.0643 | 82   | 0.1792        |
+| 0.0651 | 83   | 0.2363        |
+| 0.0658 | 84   | 0.1842        |
+| 0.0666 | 85   | 0.1673        |
+| 0.0674 | 86   | 0.2068        |
+| 0.0682 | 87   | 0.2386        |
+| 0.0690 | 88   | 0.1905        |
+| 0.0698 | 89   | 0.22          |
+| 0.0705 | 90   | 0.2351        |
+| 0.0713 | 91   | 0.2444        |
+| 0.0721 | 92   | 0.1984        |
+| 0.0729 | 93   | 0.1823        |
+| 0.0737 | 94   | 0.201         |
+| 0.0745 | 95   | 0.1548        |
+| 0.0752 | 96   | 0.1824        |
+| 0.0760 | 97   | 0.2315        |
+| 0.0768 | 98   | 0.2042        |
+| 0.0776 | 99   | 0.1579        |
+| 0.0784 | 100  | 0.1906        |
+| 0.0792 | 101  | 0.2058        |
+| 0.0800 | 102  | 0.2094        |
+| 0.0807 | 103  | 0.2149        |
+| 0.0815 | 104  | 0.2138        |
+| 0.0823 | 105  | 0.1932        |
+| 0.0831 | 106  | 0.1874        |
+| 0.0839 | 107  | 0.1945        |
+| 0.0847 | 108  | 0.1705        |
+| 0.0854 | 109  | 0.1832        |
+| 0.0862 | 110  | 0.2075        |
+| 0.0870 | 111  | 0.1586        |
+| 0.0878 | 112  | 0.139         |
+| 0.0886 | 113  | 0.1496        |
+| 0.0894 | 114  | 0.1843        |
+| 0.0901 | 115  | 0.2377        |
+| 0.0909 | 116  | 0.1998        |
+| 0.0917 | 117  | 0.1491        |
+| 0.0925 | 118  | 0.1763        |
+| 0.0933 | 119  | 0.128         |
+| 0.0941 | 120  | 0.1595        |
+| 0.0948 | 121  | 0.1816        |
+| 0.0956 | 122  | 0.2252        |
+| 0.0964 | 123  | 0.1829        |
+| 0.0972 | 124  | 0.1505        |
+| 0.0980 | 125  | 0.1726        |
+| 0.0988 | 126  | 0.2009        |
+| 0.0995 | 127  | 0.2219        |
+| 0.1003 | 128  | 0.1384        |
+| 0.1011 | 129  | 0.1243        |
+| 0.1019 | 130  | 0.2139        |
+| 0.1027 | 131  | 0.1677        |
+| 0.1035 | 132  | 0.1957        |
+| 0.1043 | 133  | 0.1683        |
+| 0.1050 | 134  | 0.168         |
+| 0.1058 | 135  | 0.2021        |
+| 0.1066 | 136  | 0.2112        |
+| 0.1074 | 137  | 0.2093        |
+| 0.1082 | 138  | 0.2279        |
+| 0.1090 | 139  | 0.2001        |
+| 0.1097 | 140  | 0.179         |
+| 0.1105 | 141  | 0.1954        |
+| 0.1113 | 142  | 0.172         |
+| 0.1121 | 143  | 0.1969        |
+| 0.1129 | 144  | 0.1561        |
+| 0.1137 | 145  | 0.1802        |
+| 0.1144 | 146  | 0.1885        |
+| 0.1152 | 147  | 0.1438        |
+| 0.1160 | 148  | 0.1791        |
+| 0.1168 | 149  | 0.1905        |
+| 0.1176 | 150  | 0.2506        |
+| 0.1184 | 151  | 0.2024        |
+| 0.1191 | 152  | 0.2059        |
+| 0.1199 | 153  | 0.2393        |
+| 0.1207 | 154  | 0.1531        |
+| 0.1215 | 155  | 0.1888        |
+| 0.1223 | 156  | 0.1831        |
+| 0.1231 | 157  | 0.1378        |
+| 0.1238 | 158  | 0.1553        |
+| 0.1246 | 159  | 0.2004        |
+| 0.1254 | 160  | 0.2071        |
+| 0.1262 | 161  | 0.1909        |
+| 0.1270 | 162  | 0.1763        |
+| 0.1278 | 163  | 0.1914        |
+| 0.1286 | 164  | 0.1365        |
+| 0.1293 | 165  | 0.2272        |
+| 0.1301 | 166  | 0.1484        |
+| 0.1309 | 167  | 0.2181        |
+| 0.1317 | 168  | 0.2386        |
+| 0.1325 | 169  | 0.2005        |
+| 0.1333 | 170  | 0.1757        |
+| 0.1340 | 171  | 0.1679        |
+| 0.1348 | 172  | 0.1707        |
+| 0.1356 | 173  | 0.1448        |
+| 0.1364 | 174  | 0.1703        |
+| 0.1372 | 175  | 0.1739        |
+| 0.1380 | 176  | 0.1376        |
+| 0.1387 | 177  | 0.1906        |
+| 0.1395 | 178  | 0.2542        |
+| 0.1403 | 179  | 0.1438        |
+| 0.1411 | 180  | 0.1786        |
+| 0.1419 | 181  | 0.1838        |
+| 0.1427 | 182  | 0.1592        |
+| 0.1434 | 183  | 0.1991        |
+| 0.1442 | 184  | 0.1702        |
+| 0.1450 | 185  | 0.1787        |
+| 0.1458 | 186  | 0.1631        |
+| 0.1466 | 187  | 0.2697        |
+| 0.1474 | 188  | 0.1654        |
+| 0.1481 | 189  | 0.2037        |
+| 0.1489 | 190  | 0.1751        |
+| 0.1497 | 191  | 0.212         |
+| 0.1505 | 192  | 0.1531        |
+| 0.1513 | 193  | 0.1802        |
+| 0.1521 | 194  | 0.1421        |
+| 0.1529 | 195  | 0.236         |
+| 0.1536 | 196  | 0.1702        |
+| 0.1544 | 197  | 0.1869        |
+| 0.1552 | 198  | 0.1796        |
+| 0.1560 | 199  | 0.1537        |
+| 0.1568 | 200  | 0.1646        |
+| 0.1576 | 201  | 0.1603        |
+| 0.1583 | 202  | 0.1662        |
+| 0.1591 | 203  | 0.1323        |
+| 0.1599 | 204  | 0.1672        |
+| 0.1607 | 205  | 0.2217        |
+| 0.1615 | 206  | 0.144         |
+| 0.1623 | 207  | 0.1889        |
+| 0.1630 | 208  | 0.159         |
+| 0.1638 | 209  | 0.1298        |
+| 0.1646 | 210  | 0.1245        |
+| 0.1654 | 211  | 0.1815        |
+| 0.1662 | 212  | 0.1771        |
+| 0.1670 | 213  | 0.1441        |
+| 0.1677 | 214  | 0.1834        |
+| 0.1685 | 215  | 0.1997        |
+| 0.1693 | 216  | 0.203         |
+| 0.1701 | 217  | 0.1986        |
+| 0.1709 | 218  | 0.1965        |
+| 0.1717 | 219  | 0.1682        |
+| 0.1724 | 220  | 0.1485        |
+| 0.1732 | 221  | 0.1531        |
+| 0.1740 | 222  | 0.16          |
+| 0.1748 | 223  | 0.1554        |
+| 0.1756 | 224  | 0.1705        |
+| 0.1764 | 225  | 0.1771        |
+| 0.1772 | 226  | 0.1507        |
+| 0.1779 | 227  | 0.1623        |
+| 0.1787 | 228  | 0.1527        |
+| 0.1795 | 229  | 0.1332        |
+| 0.1803 | 230  | 0.1556        |
+| 0.1811 | 231  | 0.1504        |
+| 0.1819 | 232  | 0.1581        |
+| 0.1826 | 233  | 0.15          |
+| 0.1834 | 234  | 0.2012        |
+| 0.1842 | 235  | 0.1587        |
+| 0.1850 | 236  | 0.2141        |
+| 0.1858 | 237  | 0.1431        |
+| 0.1866 | 238  | 0.1092        |
+| 0.1873 | 239  | 0.1688        |
+| 0.1881 | 240  | 0.2185        |
+| 0.1889 | 241  | 0.2071        |
+| 0.1897 | 242  | 0.1575        |
+| 0.1905 | 243  | 0.1251        |
+| 0.1913 | 244  | 0.1692        |
+| 0.1920 | 245  | 0.1746        |
+| 0.1928 | 246  | 0.2024        |
+| 0.1936 | 247  | 0.2074        |
+| 0.1944 | 248  | 0.2422        |
+| 0.1952 | 249  | 0.1994        |
+| 0.1960 | 250  | 0.1672        |
+| 0.1967 | 251  | 0.1474        |
+| 0.1975 | 252  | 0.1888        |
+| 0.1983 | 253  | 0.2173        |
+| 0.1991 | 254  | 0.1448        |
+| 0.1999 | 255  | 0.2403        |
+| 0.2007 | 256  | 0.1652        |
+| 0.2015 | 257  | 0.1929        |
+| 0.2022 | 258  | 0.1272        |
+| 0.2030 | 259  | 0.193         |
+| 0.2038 | 260  | 0.1665        |
+| 0.2046 | 261  | 0.1677        |
+| 0.2054 | 262  | 0.1558        |
+| 0.2062 | 263  | 0.1825        |
+| 0.2069 | 264  | 0.1549        |
+| 0.2077 | 265  | 0.199         |
+| 0.2085 | 266  | 0.1495        |
+| 0.2093 | 267  | 0.1478        |
+| 0.2101 | 268  | 0.168         |
+| 0.2109 | 269  | 0.1015        |
+| 0.2116 | 270  | 0.1924        |
+| 0.2124 | 271  | 0.1397        |
+| 0.2132 | 272  | 0.1449        |
+| 0.2140 | 273  | 0.1797        |
+| 0.2148 | 274  | 0.1689        |
+| 0.2156 | 275  | 0.1738        |
+| 0.2163 | 276  | 0.1758        |
+| 0.2171 | 277  | 0.1298        |
+| 0.2179 | 278  | 0.1889        |
+| 0.2187 | 279  | 0.1377        |
+| 0.2195 | 280  | 0.1592        |
+| 0.2203 | 281  | 0.1506        |
+| 0.2210 | 282  | 0.1622        |
+| 0.2218 | 283  | 0.1484        |
+| 0.2226 | 284  | 0.1493        |
+| 0.2234 | 285  | 0.1305        |
+| 0.2242 | 286  | 0.1131        |
+| 0.2250 | 287  | 0.1466        |
+| 0.2257 | 288  | 0.1267        |
+| 0.2265 | 289  | 0.1426        |
+| 0.2273 | 290  | 0.1649        |
+| 0.2281 | 291  | 0.1263        |
+| 0.2289 | 292  | 0.2029        |
+| 0.2297 | 293  | 0.1845        |
+| 0.2305 | 294  | 0.1364        |
+| 0.2312 | 295  | 0.1688        |
+| 0.2320 | 296  | 0.2093        |
+| 0.2328 | 297  | 0.1605        |
+| 0.2336 | 298  | 0.1206        |
+| 0.2344 | 299  | 0.2165        |
+| 0.2352 | 300  | 0.2139        |
+| 0.2359 | 301  | 0.1673        |
+| 0.2367 | 302  | 0.1455        |
+| 0.2375 | 303  | 0.1617        |
+| 0.2383 | 304  | 0.1663        |
+| 0.2391 | 305  | 0.1649        |
+| 0.2399 | 306  | 0.1358        |
+| 0.2406 | 307  | 0.1746        |
+| 0.2414 | 308  | 0.1664        |
+| 0.2422 | 309  | 0.1135        |
+| 0.2430 | 310  | 0.1612        |
+| 0.2438 | 311  | 0.1529        |
+| 0.2446 | 312  | 0.1367        |
+| 0.2453 | 313  | 0.1709        |
+| 0.2461 | 314  | 0.1757        |
+| 0.2469 | 315  | 0.1885        |
+| 0.2477 | 316  | 0.1792        |
+| 0.2485 | 317  | 0.1195        |
+| 0.2493 | 318  | 0.1451        |
+| 0.2500 | 319  | 0.1684        |
+| 0.2508 | 320  | 0.1299        |
+| 0.2516 | 321  | 0.1867        |
+| 0.2524 | 322  | 0.1899        |
+| 0.2532 | 323  | 0.1329        |
+| 0.2540 | 324  | 0.1403        |
+| 0.2548 | 325  | 0.1862        |
+| 0.2555 | 326  | 0.1407        |
+| 0.2563 | 327  | 0.1756        |
+| 0.2571 | 328  | 0.1465        |
+| 0.2579 | 329  | 0.1638        |
+| 0.2587 | 330  | 0.1506        |
+| 0.2595 | 331  | 0.1431        |
+| 0.2602 | 332  | 0.1975        |
+| 0.2610 | 333  | 0.1678        |
+| 0.2618 | 334  | 0.1695        |
+| 0.2626 | 335  | 0.1905        |
+| 0.2634 | 336  | 0.1754        |
+| 0.2642 | 337  | 0.145         |
+| 0.2649 | 338  | 0.1787        |
+| 0.2657 | 339  | 0.1464        |
+| 0.2665 | 340  | 0.1598        |
+| 0.2673 | 341  | 0.1159        |
+| 0.2681 | 342  | 0.1573        |
+| 0.2689 | 343  | 0.2009        |
+| 0.2696 | 344  | 0.2046        |
+| 0.2704 | 345  | 0.1523        |
+| 0.2712 | 346  | 0.1293        |
+| 0.2720 | 347  | 0.1614        |
+| 0.2728 | 348  | 0.1538        |
+| 0.2736 | 349  | 0.1418        |
+| 0.2743 | 350  | 0.158         |
+| 0.2751 | 351  | 0.1443        |
+| 0.2759 | 352  | 0.1437        |
+| 0.2767 | 353  | 0.1506        |
+| 0.2775 | 354  | 0.1452        |
+| 0.2783 | 355  | 0.1637        |
+| 0.2791 | 356  | 0.1015        |
+| 0.2798 | 357  | 0.1531        |
+| 0.2806 | 358  | 0.162         |
+| 0.2814 | 359  | 0.1166        |
+| 0.2822 | 360  | 0.1968        |
+| 0.2830 | 361  | 0.1828        |
+| 0.2838 | 362  | 0.1281        |
+| 0.2845 | 363  | 0.1738        |
+| 0.2853 | 364  | 0.1785        |
+| 0.2861 | 365  | 0.1475        |
+| 0.2869 | 366  | 0.179         |
+| 0.2877 | 367  | 0.1322        |
+| 0.2885 | 368  | 0.234         |
+| 0.2892 | 369  | 0.1465        |
+| 0.2900 | 370  | 0.125         |
+| 0.2908 | 371  | 0.1945        |
+| 0.2916 | 372  | 0.1728        |
+| 0.2924 | 373  | 0.1246        |
+| 0.2932 | 374  | 0.1662        |
+| 0.2939 | 375  | 0.1881        |
+| 0.2947 | 376  | 0.1409        |
+| 0.2955 | 377  | 0.188         |
+| 0.2963 | 378  | 0.1482        |
+| 0.2971 | 379  | 0.1451        |
+| 0.2979 | 380  | 0.1562        |
+| 0.2986 | 381  | 0.1606        |
+| 0.2994 | 382  | 0.1437        |
+| 0.3002 | 383  | 0.1271        |
+| 0.3010 | 384  | 0.1796        |
+| 0.3018 | 385  | 0.14          |
+| 0.3026 | 386  | 0.1645        |
+| 0.3034 | 387  | 0.1589        |
+| 0.3041 | 388  | 0.1668        |
+| 0.3049 | 389  | 0.1176        |
+| 0.3057 | 390  | 0.1651        |
+| 0.3065 | 391  | 0.1425        |
+| 0.3073 | 392  | 0.194         |
+| 0.3081 | 393  | 0.13          |
+| 0.3088 | 394  | 0.1302        |
+| 0.3096 | 395  | 0.1224        |
+| 0.3104 | 396  | 0.1249        |
+| 0.3112 | 397  | 0.1821        |
+| 0.3120 | 398  | 0.1551        |
+| 0.3128 | 399  | 0.1444        |
+| 0.3135 | 400  | 0.1841        |
+| 0.3143 | 401  | 0.1276        |
+| 0.3151 | 402  | 0.1733        |
+| 0.3159 | 403  | 0.1595        |
+| 0.3167 | 404  | 0.2037        |
+| 0.3175 | 405  | 0.1601        |
+| 0.3182 | 406  | 0.1501        |
+| 0.3190 | 407  | 0.1467        |
+| 0.3198 | 408  | 0.1194        |
+| 0.3206 | 409  | 0.1532        |
+| 0.3214 | 410  | 0.1292        |
+| 0.3222 | 411  | 0.1576        |
+| 0.3229 | 412  | 0.1431        |
+| 0.3237 | 413  | 0.151         |
+| 0.3245 | 414  | 0.1024        |
+| 0.3253 | 415  | 0.1696        |
+| 0.3261 | 416  | 0.129         |
+| 0.3269 | 417  | 0.1934        |
+| 0.3277 | 418  | 0.2072        |
+| 0.3284 | 419  | 0.1387        |
+| 0.3292 | 420  | 0.146         |
+| 0.3300 | 421  | 0.1325        |
+| 0.3308 | 422  | 0.1555        |
+| 0.3316 | 423  | 0.1281        |
+| 0.3324 | 424  | 0.1869        |
+| 0.3331 | 425  | 0.1802        |
+| 0.3339 | 426  | 0.1774        |
+| 0.3347 | 427  | 0.1495        |
+| 0.3355 | 428  | 0.1022        |
+| 0.3363 | 429  | 0.1546        |
+| 0.3371 | 430  | 0.1512        |
+| 0.3378 | 431  | 0.1734        |
+| 0.3386 | 432  | 0.1285        |
+| 0.3394 | 433  | 0.1562        |
+| 0.3402 | 434  | 0.1437        |
+| 0.3410 | 435  | 0.1485        |
+| 0.3418 | 436  | 0.1443        |
+| 0.3425 | 437  | 0.1304        |
+| 0.3433 | 438  | 0.1479        |
+| 0.3441 | 439  | 0.1544        |
+| 0.3449 | 440  | 0.1947        |
+| 0.3457 | 441  | 0.1685        |
+| 0.3465 | 442  | 0.1715        |
+| 0.3472 | 443  | 0.1269        |
+| 0.3480 | 444  | 0.1739        |
+| 0.3488 | 445  | 0.1798        |
+| 0.3496 | 446  | 0.1329        |
+| 0.3504 | 447  | 0.1737        |
+| 0.3512 | 448  | 0.1197        |
+| 0.3519 | 449  | 0.1326        |
+| 0.3527 | 450  | 0.131         |
+| 0.3535 | 451  | 0.1498        |
+| 0.3543 | 452  | 0.1836        |
+| 0.3551 | 453  | 0.115         |
+| 0.3559 | 454  | 0.1766        |
+| 0.3567 | 455  | 0.1289        |
+| 0.3574 | 456  | 0.1359        |
+| 0.3582 | 457  | 0.1245        |
+| 0.3590 | 458  | 0.1793        |
+| 0.3598 | 459  | 0.1615        |
+| 0.3606 | 460  | 0.1122        |
+| 0.3614 | 461  | 0.1767        |
+| 0.3621 | 462  | 0.1464        |
+| 0.3629 | 463  | 0.1377        |
+| 0.3637 | 464  | 0.1341        |
+| 0.3645 | 465  | 0.1511        |
+| 0.3653 | 466  | 0.1444        |
+| 0.3661 | 467  | 0.1407        |
+| 0.3668 | 468  | 0.1602        |
+| 0.3676 | 469  | 0.1352        |
+| 0.3684 | 470  | 0.1203        |
+| 0.3692 | 471  | 0.1367        |
+| 0.3700 | 472  | 0.1554        |
+| 0.3708 | 473  | 0.1006        |
+| 0.3715 | 474  | 0.1499        |
+| 0.3723 | 475  | 0.1324        |
+| 0.3731 | 476  | 0.1654        |
+| 0.3739 | 477  | 0.1509        |
+| 0.3747 | 478  | 0.1237        |
+| 0.3755 | 479  | 0.1298        |
+| 0.3762 | 480  | 0.1403        |
+| 0.3770 | 481  | 0.1314        |
+| 0.3778 | 482  | 0.1704        |
+| 0.3786 | 483  | 0.1285        |
+| 0.3794 | 484  | 0.1896        |
+| 0.3802 | 485  | 0.1358        |
+| 0.3810 | 486  | 0.1065        |
+| 0.3817 | 487  | 0.1382        |
+| 0.3825 | 488  | 0.1372        |
+| 0.3833 | 489  | 0.1215        |
+| 0.3841 | 490  | 0.2131        |
+| 0.3849 | 491  | 0.1512        |
+| 0.3857 | 492  | 0.1323        |
+| 0.3864 | 493  | 0.1398        |
+| 0.3872 | 494  | 0.151         |
+| 0.3880 | 495  | 0.1297        |
+| 0.3888 | 496  | 0.1852        |
+| 0.3896 | 497  | 0.1044        |
+| 0.3904 | 498  | 0.1185        |
+| 0.3911 | 499  | 0.1724        |
+| 0.3919 | 500  | 0.097         |
+| 0.3927 | 501  | 0.1486        |
+| 0.3935 | 502  | 0.1124        |
+| 0.3943 | 503  | 0.1264        |
+| 0.3951 | 504  | 0.0993        |
+| 0.3958 | 505  | 0.1369        |
+| 0.3966 | 506  | 0.1587        |
+| 0.3974 | 507  | 0.1455        |
+| 0.3982 | 508  | 0.1236        |
+| 0.3990 | 509  | 0.1547        |
+| 0.3998 | 510  | 0.1286        |
+| 0.4005 | 511  | 0.1257        |
+| 0.4013 | 512  | 0.1452        |
+| 0.4021 | 513  | 0.1595        |
+| 0.4029 | 514  | 0.1479        |
+| 0.4037 | 515  | 0.166         |
+| 0.4045 | 516  | 0.1623        |
+| 0.4053 | 517  | 0.136         |
+| 0.4060 | 518  | 0.149         |
+| 0.4068 | 519  | 0.1496        |
+| 0.4076 | 520  | 0.1154        |
+| 0.4084 | 521  | 0.1493        |
+| 0.4092 | 522  | 0.113         |
+| 0.4100 | 523  | 0.137         |
+| 0.4107 | 524  | 0.2077        |
+| 0.4115 | 525  | 0.112         |
+| 0.4123 | 526  | 0.1491        |
+| 0.4131 | 527  | 0.1608        |
+| 0.4139 | 528  | 0.1446        |
+| 0.4147 | 529  | 0.1188        |
+| 0.4154 | 530  | 0.137         |
+| 0.4162 | 531  | 0.1072        |
+| 0.4170 | 532  | 0.088         |
+| 0.4178 | 533  | 0.1182        |
+| 0.4186 | 534  | 0.2556        |
+| 0.4194 | 535  | 0.1907        |
+| 0.4201 | 536  | 0.1156        |
+| 0.4209 | 537  | 0.1676        |
+| 0.4217 | 538  | 0.1236        |
+| 0.4225 | 539  | 0.1009        |
+| 0.4233 | 540  | 0.1567        |
+| 0.4241 | 541  | 0.2222        |
+| 0.4248 | 542  | 0.148         |
+| 0.4256 | 543  | 0.1182        |
+| 0.4264 | 544  | 0.1267        |
+| 0.4272 | 545  | 0.127         |
+| 0.4280 | 546  | 0.1372        |
+| 0.4288 | 547  | 0.1299        |
+| 0.4296 | 548  | 0.1711        |
+| 0.4303 | 549  | 0.1608        |
+| 0.4311 | 550  | 0.1278        |
+| 0.4319 | 551  | 0.106         |
+| 0.4327 | 552  | 0.1494        |
+| 0.4335 | 553  | 0.1093        |
+| 0.4343 | 554  | 0.1833        |
+| 0.4350 | 555  | 0.1876        |
+| 0.4358 | 556  | 0.1774        |
+| 0.4366 | 557  | 0.1443        |
+| 0.4374 | 558  | 0.1351        |
+| 0.4382 | 559  | 0.1094        |
+| 0.4390 | 560  | 0.1485        |
+| 0.4397 | 561  | 0.1156        |
+| 0.4405 | 562  | 0.1324        |
+| 0.4413 | 563  | 0.1314        |
+| 0.4421 | 564  | 0.1601        |
+| 0.4429 | 565  | 0.1434        |
+| 0.4437 | 566  | 0.1785        |
+| 0.4444 | 567  | 0.1044        |
+| 0.4452 | 568  | 0.1123        |
+| 0.4460 | 569  | 0.1235        |
+| 0.4468 | 570  | 0.1384        |
+| 0.4476 | 571  | 0.1357        |
+| 0.4484 | 572  | 0.1357        |
+| 0.4491 | 573  | 0.1276        |
+| 0.4499 | 574  | 0.1554        |
+| 0.4507 | 575  | 0.1235        |
+| 0.4515 | 576  | 0.1319        |
+| 0.4523 | 577  | 0.1862        |
+| 0.4531 | 578  | 0.1523        |
+| 0.4539 | 579  | 0.1224        |
+| 0.4546 | 580  | 0.1629        |
+| 0.4554 | 581  | 0.1113        |
+| 0.4562 | 582  | 0.1261        |
+| 0.4570 | 583  | 0.1246        |
+| 0.4578 | 584  | 0.1461        |
+| 0.4586 | 585  | 0.1831        |
+| 0.4593 | 586  | 0.138         |
+| 0.4601 | 587  | 0.1206        |
+| 0.4609 | 588  | 0.1269        |
+| 0.4617 | 589  | 0.1512        |
+| 0.4625 | 590  | 0.1131        |
+| 0.4633 | 591  | 0.1206        |
+| 0.4640 | 592  | 0.1555        |
+| 0.4648 | 593  | 0.1404        |
+| 0.4656 | 594  | 0.101         |
+| 0.4664 | 595  | 0.0881        |
+| 0.4672 | 596  | 0.1793        |
+| 0.4680 | 597  | 0.0995        |
+| 0.4687 | 598  | 0.1369        |
+| 0.4695 | 599  | 0.141         |
+| 0.4703 | 600  | 0.1494        |
+| 0.4711 | 601  | 0.1824        |
+| 0.4719 | 602  | 0.1671        |
+| 0.4727 | 603  | 0.1805        |
+| 0.4734 | 604  | 0.1475        |
+| 0.4742 | 605  | 0.1128        |
+| 0.4750 | 606  | 0.1748        |
+| 0.4758 | 607  | 0.1564        |
+| 0.4766 | 608  | 0.0922        |
+| 0.4774 | 609  | 0.1008        |
+| 0.4782 | 610  | 0.1324        |
+| 0.4789 | 611  | 0.1022        |
+| 0.4797 | 612  | 0.1604        |
+| 0.4805 | 613  | 0.145         |
+| 0.4813 | 614  | 0.1621        |
+| 0.4821 | 615  | 0.15          |
+| 0.4829 | 616  | 0.1092        |
+| 0.4836 | 617  | 0.1239        |
+| 0.4844 | 618  | 0.1352        |
+| 0.4852 | 619  | 0.1098        |
+| 0.4860 | 620  | 0.1341        |
+| 0.4868 | 621  | 0.1538        |
+| 0.4876 | 622  | 0.1146        |
+| 0.4883 | 623  | 0.1498        |
+| 0.4891 | 624  | 0.1358        |
+| 0.4899 | 625  | 0.1571        |
+| 0.4907 | 626  | 0.1508        |
+| 0.4915 | 627  | 0.1424        |
+| 0.4923 | 628  | 0.1731        |
+| 0.4930 | 629  | 0.1398        |
+| 0.4938 | 630  | 0.1234        |
+| 0.4946 | 631  | 0.1409        |
+| 0.4954 | 632  | 0.136         |
+| 0.4962 | 633  | 0.1294        |
+| 0.4970 | 634  | 0.1612        |
+| 0.4977 | 635  | 0.1597        |
+| 0.4985 | 636  | 0.1685        |
+| 0.4993 | 637  | 0.1723        |
+| 0.5001 | 638  | 0.1643        |
+| 0.5009 | 639  | 0.1831        |
+| 0.5017 | 640  | 0.0791        |
+| 0.5024 | 641  | 0.1109        |
+| 0.5032 | 642  | 0.1189        |
+| 0.5040 | 643  | 0.1484        |
+| 0.5048 | 644  | 0.1399        |
+| 0.5056 | 645  | 0.1519        |
+| 0.5064 | 646  | 0.1182        |
+| 0.5072 | 647  | 0.1969        |
+| 0.5079 | 648  | 0.1729        |
+| 0.5087 | 649  | 0.1119        |
+| 0.5095 | 650  | 0.099         |
+| 0.5103 | 651  | 0.1265        |
+| 0.5111 | 652  | 0.1068        |
+| 0.5119 | 653  | 0.173         |
+| 0.5126 | 654  | 0.1059        |
+| 0.5134 | 655  | 0.1622        |
+| 0.5142 | 656  | 0.1787        |
+| 0.5150 | 657  | 0.2004        |
+| 0.5158 | 658  | 0.1282        |
+| 0.5166 | 659  | 0.1218        |
+| 0.5173 | 660  | 0.1457        |
+| 0.5181 | 661  | 0.0966        |
+| 0.5189 | 662  | 0.1101        |
+| 0.5197 | 663  | 0.1581        |
+| 0.5205 | 664  | 0.1162        |
+| 0.5213 | 665  | 0.1724        |
+| 0.5220 | 666  | 0.1455        |
+| 0.5228 | 667  | 0.1586        |
+| 0.5236 | 668  | 0.1283        |
+| 0.5244 | 669  | 0.1475        |
+| 0.5252 | 670  | 0.1136        |
+| 0.5260 | 671  | 0.1461        |
+| 0.5267 | 672  | 0.1789        |
+| 0.5275 | 673  | 0.1617        |
+| 0.5283 | 674  | 0.1344        |
+| 0.5291 | 675  | 0.1603        |
+| 0.5299 | 676  | 0.1529        |
+| 0.5307 | 677  | 0.1135        |
+| 0.5315 | 678  | 0.1312        |
+| 0.5322 | 679  | 0.1493        |
+| 0.5330 | 680  | 0.158         |
+| 0.5338 | 681  | 0.1032        |
+| 0.5346 | 682  | 0.1082        |
+| 0.5354 | 683  | 0.1043        |
+| 0.5362 | 684  | 0.1127        |
+| 0.5369 | 685  | 0.105         |
+| 0.5377 | 686  | 0.1703        |
+| 0.5385 | 687  | 0.1805        |
+| 0.5393 | 688  | 0.1098        |
+| 0.5401 | 689  | 0.1161        |
+| 0.5409 | 690  | 0.107         |
+| 0.5416 | 691  | 0.1619        |
+| 0.5424 | 692  | 0.1076        |
+| 0.5432 | 693  | 0.1248        |
+| 0.5440 | 694  | 0.117         |
+| 0.5448 | 695  | 0.1158        |
+| 0.5456 | 696  | 0.1665        |
+| 0.5463 | 697  | 0.1261        |
+| 0.5471 | 698  | 0.1074        |
+| 0.5479 | 699  | 0.1018        |
+| 0.5487 | 700  | 0.1425        |
+| 0.5495 | 701  | 0.1119        |
+| 0.5503 | 702  | 0.1608        |
+| 0.5510 | 703  | 0.1732        |
+| 0.5518 | 704  | 0.1324        |
+| 0.5526 | 705  | 0.1151        |
+| 0.5534 | 706  | 0.1368        |
+| 0.5542 | 707  | 0.1507        |
+| 0.5550 | 708  | 0.1703        |
+| 0.5558 | 709  | 0.1286        |
+| 0.5565 | 710  | 0.1305        |
+| 0.5573 | 711  | 0.1771        |
+| 0.5581 | 712  | 0.1106        |
+| 0.5589 | 713  | 0.1431        |
+| 0.5597 | 714  | 0.1381        |
+| 0.5605 | 715  | 0.1388        |
+| 0.5612 | 716  | 0.1536        |
+| 0.5620 | 717  | 0.1843        |
+| 0.5628 | 718  | 0.1695        |
+| 0.5636 | 719  | 0.1179        |
+| 0.5644 | 720  | 0.1113        |
+| 0.5652 | 721  | 0.0922        |
+| 0.5659 | 722  | 0.1341        |
+| 0.5667 | 723  | 0.1129        |
+| 0.5675 | 724  | 0.1344        |
+| 0.5683 | 725  | 0.1571        |
+| 0.5691 | 726  | 0.1257        |
+| 0.5699 | 727  | 0.126         |
+| 0.5706 | 728  | 0.1706        |
+| 0.5714 | 729  | 0.1245        |
+| 0.5722 | 730  | 0.1703        |
+| 0.5730 | 731  | 0.1304        |
+| 0.5738 | 732  | 0.1552        |
+| 0.5746 | 733  | 0.1036        |
+| 0.5753 | 734  | 0.1269        |
+| 0.5761 | 735  | 0.1355        |
+| 0.5769 | 736  | 0.1153        |
+| 0.5777 | 737  | 0.0923        |
+| 0.5785 | 738  | 0.1359        |
+| 0.5793 | 739  | 0.1495        |
+| 0.5801 | 740  | 0.1818        |
+| 0.5808 | 741  | 0.1325        |
+| 0.5816 | 742  | 0.1755        |
+| 0.5824 | 743  | 0.1443        |
+| 0.5832 | 744  | 0.1255        |
+| 0.5840 | 745  | 0.1248        |
+| 0.5848 | 746  | 0.1161        |
+| 0.5855 | 747  | 0.1513        |
+| 0.5863 | 748  | 0.1117        |
+| 0.5871 | 749  | 0.156         |
+| 0.5879 | 750  | 0.1238        |
+| 0.5887 | 751  | 0.1318        |
+| 0.5895 | 752  | 0.1406        |
+| 0.5902 | 753  | 0.1065        |
+| 0.5910 | 754  | 0.1227        |
+| 0.5918 | 755  | 0.1444        |
+| 0.5926 | 756  | 0.1059        |
+| 0.5934 | 757  | 0.1307        |
+| 0.5942 | 758  | 0.1253        |
+| 0.5949 | 759  | 0.0993        |
+| 0.5957 | 760  | 0.1243        |
+| 0.5965 | 761  | 0.1326        |
+| 0.5973 | 762  | 0.1638        |
+| 0.5981 | 763  | 0.1423        |
+| 0.5989 | 764  | 0.1804        |
+| 0.5996 | 765  | 0.1176        |
+| 0.6004 | 766  | 0.1022        |
+| 0.6012 | 767  | 0.1451        |
+| 0.6020 | 768  | 0.1497        |
+| 0.6028 | 769  | 0.1407        |
+| 0.6036 | 770  | 0.1235        |
+| 0.6044 | 771  | 0.1017        |
+| 0.6051 | 772  | 0.1705        |
+| 0.6059 | 773  | 0.1385        |
+| 0.6067 | 774  | 0.1194        |
+| 0.6075 | 775  | 0.1029        |
+| 0.6083 | 776  | 0.139         |
+| 0.6091 | 777  | 0.1298        |
+| 0.6098 | 778  | 0.1878        |
+| 0.6106 | 779  | 0.1353        |
+| 0.6114 | 780  | 0.1413        |
+| 0.6122 | 781  | 0.1129        |
+| 0.6130 | 782  | 0.1296        |
+| 0.6138 | 783  | 0.1532        |
+| 0.6145 | 784  | 0.1769        |
+| 0.6153 | 785  | 0.1235        |
+| 0.6161 | 786  | 0.1059        |
+| 0.6169 | 787  | 0.1224        |
+| 0.6177 | 788  | 0.1591        |
+| 0.6185 | 789  | 0.1127        |
+| 0.6192 | 790  | 0.1519        |
+| 0.6200 | 791  | 0.1473        |
+| 0.6208 | 792  | 0.0953        |
+| 0.6216 | 793  | 0.1302        |
+| 0.6224 | 794  | 0.149         |
+| 0.6232 | 795  | 0.1053        |
+| 0.6239 | 796  | 0.1712        |
+| 0.6247 | 797  | 0.1342        |
+| 0.6255 | 798  | 0.1199        |
+| 0.6263 | 799  | 0.1099        |
+| 0.6271 | 800  | 0.1545        |
+| 0.6279 | 801  | 0.1158        |
+| 0.6286 | 802  | 0.1541        |
+| 0.6294 | 803  | 0.1234        |
+| 0.6302 | 804  | 0.1451        |
+| 0.6310 | 805  | 0.1069        |
+| 0.6318 | 806  | 0.1282        |
+| 0.6326 | 807  | 0.1589        |
+| 0.6334 | 808  | 0.1358        |
+| 0.6341 | 809  | 0.1515        |
+| 0.6349 | 810  | 0.1334        |
+| 0.6357 | 811  | 0.1232        |
+| 0.6365 | 812  | 0.1612        |
+| 0.6373 | 813  | 0.1379        |
+| 0.6381 | 814  | 0.1347        |
+| 0.6388 | 815  | 0.1588        |
+| 0.6396 | 816  | 0.1173        |
+| 0.6404 | 817  | 0.1318        |
+| 0.6412 | 818  | 0.1541        |
+| 0.6420 | 819  | 0.1054        |
+| 0.6428 | 820  | 0.1117        |
+| 0.6435 | 821  | 0.1684        |
+| 0.6443 | 822  | 0.1234        |
+| 0.6451 | 823  | 0.1422        |
+| 0.6459 | 824  | 0.0979        |
+| 0.6467 | 825  | 0.1365        |
+| 0.6475 | 826  | 0.1177        |
+| 0.6482 | 827  | 0.1656        |
+| 0.6490 | 828  | 0.1288        |
+| 0.6498 | 829  | 0.1198        |
+| 0.6506 | 830  | 0.1546        |
+| 0.6514 | 831  | 0.1397        |
+| 0.6522 | 832  | 0.1578        |
+| 0.6529 | 833  | 0.1736        |
+| 0.6537 | 834  | 0.1174        |
+| 0.6545 | 835  | 0.1275        |
+| 0.6553 | 836  | 0.0971        |
+| 0.6561 | 837  | 0.1285        |
+| 0.6569 | 838  | 0.1285        |
+| 0.6577 | 839  | 0.1563        |
+| 0.6584 | 840  | 0.155         |
+| 0.6592 | 841  | 0.1398        |
+| 0.6600 | 842  | 0.1465        |
+| 0.6608 | 843  | 0.1201        |
+| 0.6616 | 844  | 0.1278        |
+| 0.6624 | 845  | 0.1155        |
+| 0.6631 | 846  | 0.0946        |
+| 0.6639 | 847  | 0.1152        |
+| 0.6647 | 848  | 0.1191        |
+| 0.6655 | 849  | 0.1175        |
+| 0.6663 | 850  | 0.133         |
+| 0.6671 | 851  | 0.1134        |
+| 0.6678 | 852  | 0.1664        |
+| 0.6686 | 853  | 0.1803        |
+| 0.6694 | 854  | 0.1155        |
+| 0.6702 | 855  | 0.1188        |
+| 0.6710 | 856  | 0.1283        |
+| 0.6718 | 857  | 0.0995        |
+| 0.6725 | 858  | 0.1438        |
+| 0.6733 | 859  | 0.1105        |
+| 0.6741 | 860  | 0.1114        |
+| 0.6749 | 861  | 0.089         |
+| 0.6757 | 862  | 0.1249        |
+| 0.6765 | 863  | 0.1194        |
+| 0.6772 | 864  | 0.1591        |
+| 0.6780 | 865  | 0.128         |
+| 0.6788 | 866  | 0.0787        |
+| 0.6796 | 867  | 0.13          |
+| 0.6804 | 868  | 0.0992        |
+| 0.6812 | 869  | 0.1229        |
+| 0.6820 | 870  | 0.095         |
+| 0.6827 | 871  | 0.1234        |
+| 0.6835 | 872  | 0.1201        |
+| 0.6843 | 873  | 0.1069        |
+| 0.6851 | 874  | 0.1282        |
+| 0.6859 | 875  | 0.1602        |
+| 0.6867 | 876  | 0.1           |
+| 0.6874 | 877  | 0.1437        |
+| 0.6882 | 878  | 0.1167        |
+| 0.6890 | 879  | 0.1841        |
+| 0.6898 | 880  | 0.1011        |
+| 0.6906 | 881  | 0.1264        |
+| 0.6914 | 882  | 0.1249        |
+| 0.6921 | 883  | 0.1261        |
+| 0.6929 | 884  | 0.1608        |
+| 0.6937 | 885  | 0.1398        |
+| 0.6945 | 886  | 0.15          |
+| 0.6953 | 887  | 0.1562        |
+| 0.6961 | 888  | 0.1092        |
+| 0.6968 | 889  | 0.1311        |
+| 0.6976 | 890  | 0.1564        |
+| 0.6984 | 891  | 0.1224        |
+| 0.6992 | 892  | 0.1126        |
+| 0.7000 | 893  | 0.0974        |
+| 0.7008 | 894  | 0.1638        |
+| 0.7015 | 895  | 0.118         |
+| 0.7023 | 896  | 0.1156        |
+| 0.7031 | 897  | 0.1141        |
+| 0.7039 | 898  | 0.1756        |
+| 0.7047 | 899  | 0.1165        |
+| 0.7055 | 900  | 0.142         |
+| 0.7063 | 901  | 0.1705        |
+| 0.7070 | 902  | 0.1311        |
+| 0.7078 | 903  | 0.1045        |
+| 0.7086 | 904  | 0.1034        |
+| 0.7094 | 905  | 0.1205        |
+| 0.7102 | 906  | 0.1448        |
+| 0.7110 | 907  | 0.1318        |
+| 0.7117 | 908  | 0.1369        |
+| 0.7125 | 909  | 0.1427        |
+| 0.7133 | 910  | 0.1218        |
+| 0.7141 | 911  | 0.103         |
+| 0.7149 | 912  | 0.1147        |
+| 0.7157 | 913  | 0.1297        |
+| 0.7164 | 914  | 0.1089        |
+| 0.7172 | 915  | 0.1371        |
+| 0.7180 | 916  | 0.1182        |
+| 0.7188 | 917  | 0.1273        |
+| 0.7196 | 918  | 0.1238        |
+| 0.7204 | 919  | 0.144         |
+| 0.7211 | 920  | 0.0859        |
+| 0.7219 | 921  | 0.0939        |
+| 0.7227 | 922  | 0.0999        |
+| 0.7235 | 923  | 0.1143        |
+| 0.7243 | 924  | 0.1251        |
+| 0.7251 | 925  | 0.107         |
+| 0.7258 | 926  | 0.1077        |
+| 0.7266 | 927  | 0.138         |
+| 0.7274 | 928  | 0.155         |
+| 0.7282 | 929  | 0.0977        |
+| 0.7290 | 930  | 0.1003        |
+| 0.7298 | 931  | 0.1382        |
+| 0.7306 | 932  | 0.1006        |
+| 0.7313 | 933  | 0.1027        |
+| 0.7321 | 934  | 0.1124        |
+| 0.7329 | 935  | 0.1813        |
+| 0.7337 | 936  | 0.1159        |
+| 0.7345 | 937  | 0.0791        |
+| 0.7353 | 938  | 0.1435        |
+| 0.7360 | 939  | 0.1288        |
+| 0.7368 | 940  | 0.1078        |
+| 0.7376 | 941  | 0.127         |
+| 0.7384 | 942  | 0.1211        |
+| 0.7392 | 943  | 0.1442        |
+| 0.7400 | 944  | 0.1668        |
+| 0.7407 | 945  | 0.1679        |
+| 0.7415 | 946  | 0.1168        |
+| 0.7423 | 947  | 0.1626        |
+| 0.7431 | 948  | 0.1538        |
+| 0.7439 | 949  | 0.0938        |
+| 0.7447 | 950  | 0.1657        |
+| 0.7454 | 951  | 0.1303        |
+| 0.7462 | 952  | 0.098         |
+| 0.7470 | 953  | 0.1014        |
+| 0.7478 | 954  | 0.1153        |
+| 0.7486 | 955  | 0.1192        |
+| 0.7494 | 956  | 0.1418        |
+| 0.7501 | 957  | 0.1206        |
+| 0.7509 | 958  | 0.109         |
+| 0.7517 | 959  | 0.1           |
+| 0.7525 | 960  | 0.115         |
+| 0.7533 | 961  | 0.1099        |
+| 0.7541 | 962  | 0.1252        |
+| 0.7549 | 963  | 0.0938        |
+| 0.7556 | 964  | 0.1704        |
+| 0.7564 | 965  | 0.1313        |
+| 0.7572 | 966  | 0.1342        |
+| 0.7580 | 967  | 0.1648        |
+| 0.7588 | 968  | 0.107         |
+| 0.7596 | 969  | 0.1177        |
+| 0.7603 | 970  | 0.1528        |
+| 0.7611 | 971  | 0.1577        |
+| 0.7619 | 972  | 0.1109        |
+| 0.7627 | 973  | 0.1336        |
+| 0.7635 | 974  | 0.1544        |
+| 0.7643 | 975  | 0.1304        |
+| 0.7650 | 976  | 0.1083        |
+| 0.7658 | 977  | 0.1017        |
+| 0.7666 | 978  | 0.1492        |
+| 0.7674 | 979  | 0.0846        |
+| 0.7682 | 980  | 0.1179        |
+| 0.7690 | 981  | 0.1634        |
+| 0.7697 | 982  | 0.0893        |
+| 0.7705 | 983  | 0.1357        |
+| 0.7713 | 984  | 0.1757        |
+| 0.7721 | 985  | 0.1112        |
+| 0.7729 | 986  | 0.1258        |
+| 0.7737 | 987  | 0.123         |
+| 0.7744 | 988  | 0.1354        |
+| 0.7752 | 989  | 0.0855        |
+| 0.7760 | 990  | 0.1167        |
+| 0.7768 | 991  | 0.1131        |
+| 0.7776 | 992  | 0.1222        |
+| 0.7784 | 993  | 0.1447        |
+| 0.7791 | 994  | 0.1122        |
+| 0.7799 | 995  | 0.1508        |
+| 0.7807 | 996  | 0.1484        |
+| 0.7815 | 997  | 0.0985        |
+| 0.7823 | 998  | 0.1686        |
+| 0.7831 | 999  | 0.1509        |
+| 0.7839 | 1000 | 0.1356        |
+| 0.7846 | 1001 | 0.1114        |
+| 0.7854 | 1002 | 0.1098        |
+| 0.7862 | 1003 | 0.1643        |
+| 0.7870 | 1004 | 0.1784        |
+| 0.7878 | 1005 | 0.1038        |
+| 0.7886 | 1006 | 0.1362        |
+| 0.7893 | 1007 | 0.1289        |
+| 0.7901 | 1008 | 0.1188        |
+| 0.7909 | 1009 | 0.1065        |
+| 0.7917 | 1010 | 0.1195        |
+| 0.7925 | 1011 | 0.1142        |
+| 0.7933 | 1012 | 0.0801        |
+| 0.7940 | 1013 | 0.1427        |
+| 0.7948 | 1014 | 0.2034        |
+| 0.7956 | 1015 | 0.1508        |
+| 0.7964 | 1016 | 0.0888        |
+| 0.7972 | 1017 | 0.0847        |
+| 0.7980 | 1018 | 0.1007        |
+| 0.7987 | 1019 | 0.1122        |
+| 0.7995 | 1020 | 0.1215        |
+| 0.8003 | 1021 | 0.1529        |
+| 0.8011 | 1022 | 0.1095        |
+| 0.8019 | 1023 | 0.1364        |
+| 0.8027 | 1024 | 0.0978        |
+| 0.8034 | 1025 | 0.1606        |
+| 0.8042 | 1026 | 0.1131        |
+| 0.8050 | 1027 | 0.0861        |
+| 0.8058 | 1028 | 0.1523        |
+| 0.8066 | 1029 | 0.1444        |
+| 0.8074 | 1030 | 0.1255        |
+| 0.8082 | 1031 | 0.1418        |
+| 0.8089 | 1032 | 0.1007        |
+| 0.8097 | 1033 | 0.1042        |
+| 0.8105 | 1034 | 0.1423        |
+| 0.8113 | 1035 | 0.1137        |
+| 0.8121 | 1036 | 0.1314        |
+| 0.8129 | 1037 | 0.1572        |
+| 0.8136 | 1038 | 0.1188        |
+| 0.8144 | 1039 | 0.0916        |
+| 0.8152 | 1040 | 0.1043        |
+| 0.8160 | 1041 | 0.1333        |
+| 0.8168 | 1042 | 0.1299        |
+| 0.8176 | 1043 | 0.1404        |
+| 0.8183 | 1044 | 0.1209        |
+| 0.8191 | 1045 | 0.0973        |
+| 0.8199 | 1046 | 0.1359        |
+| 0.8207 | 1047 | 0.1194        |
+| 0.8215 | 1048 | 0.2011        |
+| 0.8223 | 1049 | 0.1306        |
+| 0.8230 | 1050 | 0.1073        |
+| 0.8238 | 1051 | 0.1154        |
+| 0.8246 | 1052 | 0.1224        |
+| 0.8254 | 1053 | 0.1045        |
+| 0.8262 | 1054 | 0.1067        |
+| 0.8270 | 1055 | 0.1086        |
+| 0.8277 | 1056 | 0.0923        |
+| 0.8285 | 1057 | 0.1228        |
+| 0.8293 | 1058 | 0.1474        |
+| 0.8301 | 1059 | 0.0949        |
+| 0.8309 | 1060 | 0.1259        |
+| 0.8317 | 1061 | 0.1152        |
+| 0.8325 | 1062 | 0.0937        |
+| 0.8332 | 1063 | 0.1602        |
+| 0.8340 | 1064 | 0.1165        |
+| 0.8348 | 1065 | 0.1036        |
+| 0.8356 | 1066 | 0.1665        |
+| 0.8364 | 1067 | 0.1163        |
+| 0.8372 | 1068 | 0.1124        |
+| 0.8379 | 1069 | 0.1093        |
+| 0.8387 | 1070 | 0.1015        |
+| 0.8395 | 1071 | 0.1602        |
+| 0.8403 | 1072 | 0.0913        |
+| 0.8411 | 1073 | 0.1327        |
+| 0.8419 | 1074 | 0.1149        |
+| 0.8426 | 1075 | 0.1137        |
+| 0.8434 | 1076 | 0.1197        |
+| 0.8442 | 1077 | 0.1335        |
+| 0.8450 | 1078 | 0.1366        |
+| 0.8458 | 1079 | 0.1265        |
+| 0.8466 | 1080 | 0.0921        |
+| 0.8473 | 1081 | 0.1339        |
+| 0.8481 | 1082 | 0.1155        |
+| 0.8489 | 1083 | 0.103         |
+| 0.8497 | 1084 | 0.1302        |
+| 0.8505 | 1085 | 0.1311        |
+| 0.8513 | 1086 | 0.1275        |
+| 0.8520 | 1087 | 0.1585        |
+| 0.8528 | 1088 | 0.0961        |
+| 0.8536 | 1089 | 0.1222        |
+| 0.8544 | 1090 | 0.0887        |
+| 0.8552 | 1091 | 0.1599        |
+| 0.8560 | 1092 | 0.0909        |
+| 0.8568 | 1093 | 0.1566        |
+| 0.8575 | 1094 | 0.1201        |
+| 0.8583 | 1095 | 0.0786        |
+| 0.8591 | 1096 | 0.1383        |
+| 0.8599 | 1097 | 0.1593        |
+| 0.8607 | 1098 | 0.1582        |
+| 0.8615 | 1099 | 0.1474        |
+| 0.8622 | 1100 | 0.0924        |
+| 0.8630 | 1101 | 0.1379        |
+| 0.8638 | 1102 | 0.1324        |
+| 0.8646 | 1103 | 0.1139        |
+| 0.8654 | 1104 | 0.0941        |
+| 0.8662 | 1105 | 0.1107        |
+| 0.8669 | 1106 | 0.1183        |
+| 0.8677 | 1107 | 0.1024        |
+| 0.8685 | 1108 | 0.1346        |
+| 0.8693 | 1109 | 0.131         |
+| 0.8701 | 1110 | 0.1244        |
+| 0.8709 | 1111 | 0.1423        |
+| 0.8716 | 1112 | 0.1604        |
+| 0.8724 | 1113 | 0.146         |
+| 0.8732 | 1114 | 0.1398        |
+| 0.8740 | 1115 | 0.1393        |
+| 0.8748 | 1116 | 0.1643        |
+| 0.8756 | 1117 | 0.1006        |
+| 0.8763 | 1118 | 0.0956        |
+| 0.8771 | 1119 | 0.1304        |
+| 0.8779 | 1120 | 0.1151        |
+| 0.8787 | 1121 | 0.161         |
+| 0.8795 | 1122 | 0.0871        |
+| 0.8803 | 1123 | 0.1028        |
+| 0.8811 | 1124 | 0.1715        |
+| 0.8818 | 1125 | 0.1674        |
+| 0.8826 | 1126 | 0.1073        |
+| 0.8834 | 1127 | 0.0867        |
+| 0.8842 | 1128 | 0.1117        |
+| 0.8850 | 1129 | 0.1333        |
+| 0.8858 | 1130 | 0.126         |
+| 0.8865 | 1131 | 0.0853        |
+| 0.8873 | 1132 | 0.1152        |
+| 0.8881 | 1133 | 0.1467        |
+| 0.8889 | 1134 | 0.1643        |
+| 0.8897 | 1135 | 0.1117        |
+| 0.8905 | 1136 | 0.0909        |
+| 0.8912 | 1137 | 0.1645        |
+| 0.8920 | 1138 | 0.1359        |
+| 0.8928 | 1139 | 0.1204        |
+| 0.8936 | 1140 | 0.1574        |
+| 0.8944 | 1141 | 0.1187        |
+| 0.8952 | 1142 | 0.1588        |
+| 0.8959 | 1143 | 0.1419        |
+| 0.8967 | 1144 | 0.1109        |
+| 0.8975 | 1145 | 0.1048        |
+| 0.8983 | 1146 | 0.1232        |
+| 0.8991 | 1147 | 0.1159        |
+| 0.8999 | 1148 | 0.1442        |
+| 0.9006 | 1149 | 0.1345        |
+| 0.9014 | 1150 | 0.0893        |
+| 0.9022 | 1151 | 0.1033        |
+| 0.9030 | 1152 | 0.1133        |
+| 0.9038 | 1153 | 0.2009        |
+| 0.9046 | 1154 | 0.1669        |
+| 0.9053 | 1155 | 0.1095        |
+| 0.9061 | 1156 | 0.1099        |
+| 0.9069 | 1157 | 0.0893        |
+| 0.9077 | 1158 | 0.137         |
+| 0.9085 | 1159 | 0.1346        |
+| 0.9093 | 1160 | 0.1135        |
+| 0.9101 | 1161 | 0.1003        |
+| 0.9108 | 1162 | 0.1224        |
+| 0.9116 | 1163 | 0.098         |
+| 0.9124 | 1164 | 0.1353        |
+| 0.9132 | 1165 | 0.1481        |
+| 0.9140 | 1166 | 0.1168        |
+| 0.9148 | 1167 | 0.0794        |
+| 0.9155 | 1168 | 0.0979        |
+| 0.9163 | 1169 | 0.1093        |
+| 0.9171 | 1170 | 0.1022        |
+| 0.9179 | 1171 | 0.1498        |
+| 0.9187 | 1172 | 0.1596        |
+| 0.9195 | 1173 | 0.1657        |
+| 0.9202 | 1174 | 0.1195        |
+| 0.9210 | 1175 | 0.1278        |
+| 0.9218 | 1176 | 0.1307        |
+| 0.9226 | 1177 | 0.1071        |
+| 0.9234 | 1178 | 0.0969        |
+| 0.9242 | 1179 | 0.1192        |
+| 0.9249 | 1180 | 0.1166        |
+| 0.9257 | 1181 | 0.1221        |
+| 0.9265 | 1182 | 0.1179        |
+| 0.9273 | 1183 | 0.1414        |
+| 0.9281 | 1184 | 0.1247        |
+| 0.9289 | 1185 | 0.1148        |
+| 0.9296 | 1186 | 0.1211        |
+| 0.9304 | 1187 | 0.1373        |
+| 0.9312 | 1188 | 0.1105        |
+| 0.9320 | 1189 | 0.0911        |
+| 0.9328 | 1190 | 0.1205        |
+| 0.9336 | 1191 | 0.1479        |
+| 0.9344 | 1192 | 0.115         |
+| 0.9351 | 1193 | 0.0951        |
+| 0.9359 | 1194 | 0.1501        |
+| 0.9367 | 1195 | 0.1069        |
+| 0.9375 | 1196 | 0.1091        |
+| 0.9383 | 1197 | 0.0988        |
+| 0.9391 | 1198 | 0.1278        |
+| 0.9398 | 1199 | 0.1221        |
+| 0.9406 | 1200 | 0.1418        |
+| 0.9414 | 1201 | 0.1354        |
+| 0.9422 | 1202 | 0.1435        |
+| 0.9430 | 1203 | 0.101         |
+| 0.9438 | 1204 | 0.1119        |
+| 0.9445 | 1205 | 0.1566        |
+| 0.9453 | 1206 | 0.1238        |
+| 0.9461 | 1207 | 0.1008        |
+| 0.9469 | 1208 | 0.1126        |
+| 0.9477 | 1209 | 0.0897        |
+| 0.9485 | 1210 | 0.1486        |
+| 0.9492 | 1211 | 0.0976        |
+| 0.9500 | 1212 | 0.124         |
+| 0.9508 | 1213 | 0.1034        |
+| 0.9516 | 1214 | 0.1229        |
+| 0.9524 | 1215 | 0.1301        |
+| 0.9532 | 1216 | 0.1363        |
+| 0.9539 | 1217 | 0.1161        |
+| 0.9547 | 1218 | 0.1199        |
+| 0.9555 | 1219 | 0.0815        |
+| 0.9563 | 1220 | 0.1034        |
+| 0.9571 | 1221 | 0.1554        |
+| 0.9579 | 1222 | 0.1266        |
+| 0.9587 | 1223 | 0.1153        |
+| 0.9594 | 1224 | 0.1129        |
+| 0.9602 | 1225 | 0.1228        |
+| 0.9610 | 1226 | 0.1268        |
+| 0.9618 | 1227 | 0.1515        |
+| 0.9626 | 1228 | 0.0885        |
+| 0.9634 | 1229 | 0.1142        |
+| 0.9641 | 1230 | 0.187         |
+| 0.9649 | 1231 | 0.0836        |
+| 0.9657 | 1232 | 0.0967        |
+| 0.9665 | 1233 | 0.1516        |
+| 0.9673 | 1234 | 0.0581        |
+| 0.9681 | 1235 | 0.0847        |
+| 0.9688 | 1236 | 0.1105        |
+| 0.9696 | 1237 | 0.0958        |
+| 0.9704 | 1238 | 0.1238        |
+| 0.9712 | 1239 | 0.1076        |
+| 0.9720 | 1240 | 0.1137        |
+| 0.9728 | 1241 | 0.1236        |
+| 0.9735 | 1242 | 0.129         |
+| 0.9743 | 1243 | 0.1113        |
+| 0.9751 | 1244 | 0.1466        |
+| 0.9759 | 1245 | 0.1593        |
+| 0.9767 | 1246 | 0.1151        |
+| 0.9775 | 1247 | 0.153         |
+| 0.9782 | 1248 | 0.1564        |
+| 0.9790 | 1249 | 0.1208        |
+| 0.9798 | 1250 | 0.0925        |
+| 0.9806 | 1251 | 0.1146        |
+| 0.9814 | 1252 | 0.1043        |
+| 0.9822 | 1253 | 0.0926        |
+| 0.9830 | 1254 | 0.1442        |
+| 0.9837 | 1255 | 0.134         |
+| 0.9845 | 1256 | 0.0841        |
+| 0.9853 | 1257 | 0.1256        |
+| 0.9861 | 1258 | 0.12          |
+| 0.9869 | 1259 | 0.0815        |
+| 0.9877 | 1260 | 0.1298        |
+| 0.9884 | 1261 | 0.1569        |
+| 0.9892 | 1262 | 0.1296        |
+| 0.9900 | 1263 | 0.1418        |
+| 0.9908 | 1264 | 0.1204        |
+| 0.9916 | 1265 | 0.1207        |
+| 0.9924 | 1266 | 0.1116        |
+| 0.9931 | 1267 | 0.0807        |
+| 0.9939 | 1268 | 0.1082        |
+| 0.9947 | 1269 | 0.1213        |
+| 0.9955 | 1270 | 0.1156        |
+| 0.9963 | 1271 | 0.1517        |
+| 0.9971 | 1272 | 0.1238        |
+| 0.9978 | 1273 | 0.1313        |
+| 0.9986 | 1274 | 0.131         |
+| 0.9994 | 1275 | 0.1584        |
+</details>
+### Framework Versions
+- Python: 3.10.12
+- Sentence Transformers: 3.2.1
+- Transformers: 4.44.2
+- PyTorch: 2.3.1+cu121
+- Accelerate: 1.1.1
+- Datasets: 2.21.0
+- Tokenizers: 0.19.1
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### MultipleNegativesRankingLoss
+```bibtex
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply},
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "_name_or_path": "/root/models/gte-base-250k-answerableHN/checkpoint-1275",
+  "architectures": [
+    "NewModel"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration.NewConfig",
+    "AutoModel": "modeling.NewModel",
+    "AutoModelForMaskedLM": "Alibaba-NLP/new-impl--modeling.NewForMaskedLM",
+    "AutoModelForMultipleChoice": "Alibaba-NLP/new-impl--modeling.NewForMultipleChoice",
+    "AutoModelForQuestionAnswering": "Alibaba-NLP/new-impl--modeling.NewForQuestionAnswering",
+    "AutoModelForSequenceClassification": "Alibaba-NLP/new-impl--modeling.NewForSequenceClassification",
+    "AutoModelForTokenClassification": "Alibaba-NLP/new-impl--modeling.NewForTokenClassification"
+  },
+  "classifier_dropout": 0.0,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "layer_norm_type": "layer_norm",
+  "logn_attention_clip1": false,
+  "logn_attention_scale": false,
+  "max_position_embeddings": 8192,
+  "model_type": "new",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pack_qkv": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "rope",
+  "rope_scaling": {
+    "factor": 8.0,
+    "type": "ntk"
+  },
+  "rope_theta": 20000,
+  "torch_dtype": "float32",
+  "transformers_version": "4.44.2",
+  "type_vocab_size": 1,
+  "unpad_inputs": false,
+  "use_memory_efficient_attention": false,
+  "vocab_size": 250048
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "__version__": {
+    "sentence_transformers": "3.2.1",
+    "transformers": "4.44.2",
+    "pytorch": "2.3.1+cu121"
+  },
+  "prompts": {},
+  "default_prompt_name": null,
+  "similarity_fn_name": null
+}

configuration.py ADDED Viewed

	@@ -0,0 +1,145 @@

+# coding=utf-8
+# Copyright 2024 The GTE Team Authors and Alibaba Group.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" NEW model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class NewConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
+    instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the NEW
+    [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"rope"`):
+            Type of position embedding. Choose one of `"absolute"`, `"rope"`.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
+            these scaling strategies behave:
+            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
+            experimental feature, subject to breaking API changes in future versions.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+    Examples:
+    ```python
+    >>> from transformers import NewConfig, NewModel
+    >>> # Initializing a NEW izhx/new-base-en style configuration
+    >>> configuration = NewConfig()
+    >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
+    >>> model = NewModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "new"
+    def __init__(
+        self,
+        vocab_size=30528,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.0,
+        max_position_embeddings=2048,
+        type_vocab_size=1,
+        initializer_range=0.02,
+        layer_norm_type='layer_norm',
+        layer_norm_eps=1e-12,
+        # pad_token_id=0,
+        position_embedding_type="rope",
+        rope_theta=10000.0,
+        rope_scaling=None,
+        classifier_dropout=None,
+        pack_qkv=True,
+        unpad_inputs=False,
+        use_memory_efficient_attention=False,
+        logn_attention_scale=False,
+        logn_attention_clip1=False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_type = layer_norm_type
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.classifier_dropout = classifier_dropout
+        self.pack_qkv = pack_qkv
+        self.unpad_inputs = unpad_inputs
+        self.use_memory_efficient_attention = use_memory_efficient_attention
+        self.logn_attention_scale = logn_attention_scale
+        self.logn_attention_clip1 = logn_attention_clip1

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:89fbfee7d253e4980ae8ec279072e5ca562eb385ee29914ded1e270b873c7642
+size 1221487872

modeling.py ADDED Viewed

	@@ -0,0 +1,1418 @@

+# coding=utf-8
+# Copyright 2024 The GTE Team Authors and Alibaba Group.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch NEW model."""
+import math
+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPooling,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    ModelOutput,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+try:
+    import xformers.ops as xops
+except ImportError as e:
+    xops = None
+from .configuration import NewConfig
+logger = logging.get_logger(__name__)
+# Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
+# Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
+class IndexFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, input, indices):
+        ctx.save_for_backward(indices)
+        assert input.ndim >= 2
+        ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
+        second_dim = other_shape.numel()
+        # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
+        # return input[indices]
+        # return torch.gather(
+        #     rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
+        # ).reshape(-1, *other_shape)
+        return torch.gather(
+            input.view(ctx.first_axis_dim, second_dim),
+            0,
+            indices.unsqueeze(-1).expand(indices.size(0), second_dim)
+        ).reshape(-1, *other_shape)
+    @staticmethod
+    def backward(ctx, grad_output):
+        (indices,) = ctx.saved_tensors
+        assert grad_output.ndim >= 2
+        other_shape = grad_output.shape[1:]
+        # grad_output = rearrange(grad_output, "b ... -> b (...)")
+        grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
+        grad_input = torch.zeros(
+            [ctx.first_axis_dim, grad_output.shape[1]],
+            device=grad_output.device,
+            dtype=grad_output.dtype,
+        )
+        # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
+        # grad_input[indices] = grad_output
+        # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
+        grad_input.scatter_(
+            0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
+        )
+        return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
+index_first_axis = IndexFirstAxis.apply
+def unpad_input(hidden_states, attention_mask=None, indices=None):
+    """
+    Arguments:
+        hidden_states: (batch, seqlen, ...)
+        attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
+        indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
+    Return:
+        hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+    """
+    if indices is None:
+        assert attention_mask is not None
+        indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
+    # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
+    # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
+    # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
+    # so we write custom forward and backward to make it a bit faster.
+    hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
+    return index_first_axis(hidden_states, indices)
+class IndexPutFirstAxis(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx,
+        values: torch.Tensor,
+        indices: torch.Tensor,
+        first_axis_dim
+    ) -> torch.Tensor:
+        ctx.save_for_backward(indices)
+        assert indices.ndim == 1
+        assert values.ndim >= 2
+        output = torch.zeros(
+            first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
+        )
+        output[indices] = values
+        return output
+    @staticmethod
+    def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
+        indices, = ctx.saved_tensors
+        grad_values = grad_output[indices]
+        return grad_values, None, None
+index_put_first_axis = IndexPutFirstAxis.apply
+def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
+    """Add padding to sequences.
+    Arguments:
+        inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
+        indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
+        batch: int batch_size
+        seqlen: int max sequence length
+    Returns:
+        inputs: (batch, seqlen, ...)
+    """
+    output = index_put_first_axis(inputs, indices, batch * seqlen)
+    return output.view(batch, seqlen, *inputs.shape[1:])
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos, sin = cos.to(q.dtype), sin.to(q.dtype)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class RotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
+        super().__init__()
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Build here to make `torch.jit.trace` work.
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+        return (
+            self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
+            self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
+        )
+class NTKScalingRotaryEmbedding(RotaryEmbedding):
+    """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
+    def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
+        self.scaling_factor = scaling_factor
+        self.mixed_b = mixed_b
+        super().__init__(dim, max_position_embeddings, base, device)
+        max_position_embeddings = max_position_embeddings * self.scaling_factor
+        self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        if seq_len > self.max_position_embeddings:
+            base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
+            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+            if self.mixed_b is None:
+                inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim)  # (6)
+            else:
+                a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b  # (13)
+                lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp()  # (12)
+                inv_freq = inv_freq / lambda_1_m  # (10)
+            self.register_buffer("inv_freq", inv_freq, persistent=False)
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+class RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+LAYER_NORM = {
+    'layer_norm': nn.LayerNorm,
+    'rms_norm': RMSNorm
+}
+class NewEmbeddings(nn.Module):
+    """
+    Embedding and Unpadding.
+    """
+    def __init__(self, config: NewConfig):
+        super().__init__()
+        self.padding_idx = config.pad_token_id
+        self.word_embeddings = nn.Embedding(
+            config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
+        )
+        self.position_embedding_type = config.position_embedding_type
+        if self.position_embedding_type == 'absolute':
+            self.position_embeddings = nn.Embedding(
+                config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
+            )
+        elif self.position_embedding_type == 'rope':
+            self._init_rope(config)
+        else:
+            raise ValueError
+        self.type_vocab_size = config.type_vocab_size
+        if self.type_vocab_size > 0:
+            self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids is contiguous in memory and excluded when serialized
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings), persistent=False
+        )
+    def _init_rope(self, config):
+        kwargs = dict(
+            dim=int(config.hidden_size / config.num_attention_heads),
+            max_position_embeddings=config.max_position_embeddings,
+            base=config.rope_theta
+        )
+        if config.rope_scaling is None:
+            self.rotary_emb = RotaryEmbedding(**kwargs)
+        else:
+            kwargs.update(scaling_factor=config.rope_scaling["factor"])
+            scaling_type = config.rope_scaling["type"]
+            if scaling_type == 'ntk':
+                kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
+                self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
+            # elif scaling_type == "linear":
+            #     self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
+            # elif scaling_type == "dynamic":
+            #     self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+    def forward(
+        self,
+        unpad_inputs: bool,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        length: Optional[List[int]] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
+        """
+        """
+        if inputs_embeds is None:
+            device, input_shape = input_ids.device, input_ids.shape
+        else:
+            device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
+        batch_size, seq_length = input_shape
+        # Set attention_mask if it's None
+        if attention_mask is None:
+            attention_mask = torch.ones(input_shape, device=device)
+            if length is not None:
+                for i, l in enumerate(length):
+                    attention_mask[i, l:] = 0
+        # Set attention_mask_bool for unpadding
+        if unpad_inputs:
+            attention_mask_bool = attention_mask.bool()
+            if length is None:
+                length = attention_mask.sum(-1).tolist()
+        # Get word embeddings
+        if inputs_embeds is None:
+            if unpad_inputs:
+                input_ids = input_ids[attention_mask_bool].unsqueeze(0)
+            inputs_embeds = self.word_embeddings(input_ids)
+        else:
+            if unpad_inputs:
+                inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
+        embeddings = inputs_embeds
+        # Set and unpad position_ids
+        if position_ids is None:
+            if seq_length > self.position_ids.size(0):
+                self.register_buffer(
+                    "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
+                )
+            if unpad_inputs:
+                # [1, cumsum_seq_len]
+                position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
+            else:
+                # [bs, seq_len]
+                position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
+        elif unpad_inputs:
+            position_ids = position_ids[attention_mask_bool].unsqueeze(0)  # [1, cumsum_seq_len]
+        # Compute rotary embedding
+        if self.position_embedding_type == 'rope':
+            rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
+            rope_cos = rope_cos[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+            rope_sin = rope_sin[position_ids].unsqueeze(2)  # [bs, seq_len, 1, dim]
+            rope_embeds = rope_cos, rope_sin
+        else:
+            rope_embeds = None
+        if self.type_vocab_size > 0:
+            if token_type_ids is None:
+                token_type_ids = position_ids.mul(0)
+            else:
+                if self.type_vocab_size < 2:
+                    token_type_ids.mul_(0)
+                if unpad_inputs:
+                    token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
+            token_type_embeddings = self.token_type_embeddings(token_type_ids)
+            embeddings = embeddings + token_type_embeddings
+        # BERT position
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings = embeddings + position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings, attention_mask, rope_embeds, length
+class NewAttention(nn.Module):
+    def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        if pack_qkv is None:
+            pack_qkv = config.pack_qkv
+        self.pack_qkv = pack_qkv
+        if self.pack_qkv:
+            self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
+        else:
+            self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+            self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+            self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
+        if use_memory_efficient_attention is None:
+            use_memory_efficient_attention = self.config.use_memory_efficient_attention
+        self.use_memory_efficient_attention = use_memory_efficient_attention
+        self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
+        if self.use_memory_efficient_attention:
+            assert self.memory_efficient_attention is not None, 'please install xformers'
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: torch.FloatTensor,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        qkv_inputs: Optional[Tuple] = None,  # For RetroMAE
+    ) -> Tuple[torch.Tensor, ...]:
+        shape_hd = (self.num_attention_heads, self.attention_head_size)
+        # qkv
+        if self.pack_qkv and qkv_inputs is None:
+            qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
+        else:
+            if qkv_inputs is None:
+                qkv_inputs = (hidden_states, hidden_states, hidden_states)
+            qkv_pack = [
+                getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
+            ]
+        query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
+        if self.config.position_embedding_type == 'rope':
+            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
+        dtype = query_states.dtype
+        if self.config.logn_attention_scale and attention_scale is not None:
+            # https://kexue.fm/archives/8823
+            query_states = query_states * attention_scale.to(dtype)
+        if padding_inputs is not None:
+            query_states = pad_input(query_states.squeeze(), *padding_inputs)
+            key_states = pad_input(key_states.squeeze(), *padding_inputs)
+            value_states = pad_input(value_states.squeeze(), *padding_inputs)
+        if self.use_memory_efficient_attention:
+            assert self.memory_efficient_attention is not None, "xformers is not loaded"
+            assert output_attentions is False, "memory_efficient_attention do not output attentions"
+            assert head_mask is None, "Not support yet"
+            attention_probs = None
+            if torch.is_tensor(attention_bias):
+                attention_bias = attention_bias.to(dtype)
+            context_layer = self.memory_efficient_attention(
+                query_states,
+                key_states,
+                value_states,
+                attn_bias=attention_bias,
+                p=self.dropout.p
+            )
+        else:
+            if output_attentions and isinstance(self, NewSdpaAttention):
+                raise RuntimeError("SDPA do not output attentions")
+            context_layer, attention_probs = self._attention(
+                query_states, key_states, value_states, attention_bias, head_mask
+            )
+        if padding_inputs is not None:
+            context_layer = unpad_input(context_layer, indices=padding_inputs[0])
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(new_context_layer_shape)
+        # output proj
+        attn_output = self.o_proj(context_layer)
+        # add attentions if we output them
+        outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
+        return outputs
+    def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
+        """
+        Args:
+            q/k/v: (B, L, n_head, head_dim),
+        Returns:
+            attn_output: (B L, n_head, head_dim)
+        """
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_bias is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_bias
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        if self.dropout.p > 0:
+            attention_probs = self.dropout(attention_probs)
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+        context_layer = torch.matmul(attention_probs, value_states)
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        return context_layer, attention_probs
+class NewSdpaAttention(NewAttention):
+    """
+    New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
+    `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
+    SDPA API.
+    """
+    def __init__(self, config: NewConfig, **kwargs):
+        super().__init__(config, **kwargs)
+        # torch.backends.cuda.enable_mem_efficient_sdp(False)
+        # logger.warning(
+        #     "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
+        #     "`use_memory_efficient_attention=True` if it expected to use."
+        # )
+    def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_states.transpose(1, 2),
+            key_states.transpose(1, 2),
+            value_states.transpose(1, 2),
+            attn_mask=attention_bias,
+            dropout_p=self.dropout.p if self.training else 0.0,
+        )
+        attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
+        return attn_output, None
+NEW_ATTENTION_CLASSES = {
+    "eager": NewAttention,
+    # "flash_attention_2": ,  # TODO
+    "sdpa": NewSdpaAttention,
+}
+class NewGatedMLP(nn.Module):
+    """
+    GLU Variants Improve Transformer.
+    """
+    def __init__(self, config: NewConfig):
+        super().__init__()
+        self.intermediate_size = config.intermediate_size
+        self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
+        self.act_fn = ACT2FN[config.hidden_act]
+        if config.hidden_dropout_prob > 0:
+            self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
+        else:
+            self.hidden_dropout = None
+    def forward(self, hidden_states):
+        up_gate = self.up_gate_proj(hidden_states)
+        up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
+        gate = self.act_fn(gate)
+        gated_states = gate * up_states
+        if self.hidden_dropout is not None:
+            gated_states = self.hidden_dropout(gated_states)
+        down_states = self.down_proj(gated_states)
+        return down_states
+class NewLayer(nn.Module):
+    def __init__(
+        self,
+        config: NewConfig,
+        pack_qkv=None,
+        use_memory_efficient_attention=None,
+        attn_implementation=None
+    ):
+        super().__init__()
+        if attn_implementation is None:
+            attn_implementation = config._attn_implementation
+        if use_memory_efficient_attention is None:
+            use_memory_efficient_attention = config.use_memory_efficient_attention
+        if use_memory_efficient_attention:
+            if attn_implementation != 'eager':
+                logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
+                attn_implementation = 'eager'  # Since it will be SDPA by default for torch>=2.1.1
+        self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
+            config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
+        )
+        self.mlp = NewGatedMLP(config)
+        ln_class = LAYER_NORM[config.layer_norm_type]
+        self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
+        self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
+        if config.hidden_dropout_prob > 0:
+            self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
+        else:
+            self.hidden_dropout = None
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: torch.FloatTensor,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        qkv_inputs: Optional[Tuple] = None,  # For RetroMAE
+    ) -> Tuple[torch.Tensor, ...]:
+        # Multi head self attention
+        residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
+        attention_outputs = self.attention(
+            hidden_states,
+            attention_bias,
+            rope_embeds,
+            padding_inputs,
+            attention_scale,
+            head_mask,
+            output_attentions=output_attentions,
+            qkv_inputs=qkv_inputs,
+        )
+        hidden_states = attention_outputs[0]
+        if self.hidden_dropout is not None:
+            hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        # In pretraining, after the attention of last layer, we only need the masked tokens.
+        if subset_indices is not None:
+            hidden_states = hidden_states[subset_indices]
+        hidden_states = self.attn_ln(hidden_states)
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.mlp(hidden_states)
+        if self.hidden_dropout is not None:
+            hidden_states = self.hidden_dropout(hidden_states)
+        hidden_states = residual + hidden_states
+        hidden_states = self.mlp_ln(hidden_states)
+        # add self attentions if we output attention weights
+        outputs = (hidden_states,) + attention_outputs[1:]
+        return outputs
+class NewEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_bias: Optional[torch.FloatTensor] = None,
+        rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
+        padding_inputs: Optional[Tuple] = None,  # indices, batch, seqlen
+        attention_scale: Optional[torch.FloatTensor] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+        output_hidden_states: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            if i >= len(self.layer) - 1:
+                layer_subset_indices = subset_indices
+            else:
+                layer_subset_indices = None
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
+                    hidden_states,
+                    attention_bias,
+                    rope_embeds,
+                    padding_inputs,
+                    attention_scale,
+                    layer_subset_indices,
+                    layer_head_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_bias,
+                    rope_embeds,
+                    padding_inputs,
+                    attention_scale,
+                    layer_subset_indices,
+                    layer_head_mask,
+                    output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    all_hidden_states,
+                    all_self_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+# Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
+class NewPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+class NewPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = NewConfig
+    base_model_prefix = "new"
+    supports_gradient_checkpointing = True
+    _supports_sdpa = True
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+class NewModel(NewPreTrainedModel):
+    """
+    The bare New Model transformer outputting raw hidden-states without any specific head on top.
+    """
+    def __init__(self, config: NewConfig, add_pooling_layer=False):
+        super().__init__(config)
+        self.config = config
+        self.embeddings = NewEmbeddings(config)
+        self.encoder = NewEncoder(config)
+        self.pooler = NewPooler(config) if add_pooling_layer else None
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        length: Optional[List[int]] = None,
+        subset_indices: Optional[torch.LongTensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
+        r"""
+        length  (`list` of length `batch_size`, *optional*):
+            If is `None`, return padded `last_hidden_state`.
+        subset_indices  ():
+            pass
+        unpad_inputs  (`bool`, *optional*):
+            pass
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
+        output_padded = length is None
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
+            input_shape = input_ids.size()
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        # TODO: not used
+        # # Prepare head mask if needed
+        # # 1.0 in head_mask indicate we keep the head
+        # # attention_probs has shape bsz x n_heads x N x N
+        # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+        # Get embeddings, may unpad them
+        (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
+            unpad_inputs,
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            length=length,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds
+        )
+        batch_size, seq_length = input_shape
+        if unpad_inputs and self.config.use_memory_efficient_attention:
+            attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
+        else:
+            # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+            # ourselves in which case we just need to make it broadcastable to all heads.
+            attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
+            if self.config.use_memory_efficient_attention:
+                # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
+                attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
+        padding_inputs = None
+        if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
+            indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+            if not self.config.use_memory_efficient_attention:
+                padding_inputs = (indices, *input_shape)
+        attention_scale = None
+        if self.config.logn_attention_scale:
+            logger.warning_once("TODO: logn_attention_scale")
+        #     # attention scale log_512(input_len)
+        #     attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
+        #     # inference-time logn scale need clip 1
+        #     if self.config.logn_attention_clip1:
+        #         attention_scale.clip_(1)
+        #     attention_scale = attention_scale[:, None, None, None]
+        # else:
+        #     attention_scale = None
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_bias=attention_bias,
+            rope_embeds=rope_embeds,
+            padding_inputs=padding_inputs,
+            attention_scale=attention_scale,
+            subset_indices=subset_indices,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        if unpad_inputs and output_padded:
+            sequence_output = pad_input(
+                sequence_output.squeeze(), indices, batch_size, seq_length
+            )
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+class NewLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.transform_act_fn = ACT2FN[config.hidden_act]
+        self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.norm(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+class NewForMaskedLM(NewPreTrainedModel):
+    _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
+    def __init__(self, config: NewConfig):
+        super().__init__(config)
+        self.new = NewModel(config, add_pooling_layer=False)
+        self.lm_head = NewLMPredictionHead(config)
+        self.loss_fct = nn.CrossEntropyLoss()
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_output_embeddings(self):
+        return self.lm_head.decoder
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head.decoder = new_embeddings
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
+            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is None or not self.new.config.unpad_inputs:
+            length = None
+            subset_indices = None
+        else:
+            length = attention_mask.sum(-1).tolist()
+            labels = labels[attention_mask.bool()].unsqueeze(0)
+            subset_indices = labels > -100
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            length=length,
+            subset_indices=subset_indices,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+        masked_lm_loss = None
+        if labels is not None:
+            if subset_indices is None:
+                mask = attention_mask.bool()
+                prediction_scores = prediction_scores[mask]
+                labels = labels[mask]
+            else:
+                labels = labels[subset_indices]
+            masked_lm_loss = self.loss_fct(prediction_scores, labels)
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForSequenceClassification(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.new = NewModel(config, add_pooling_layer=True)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = nn.MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = nn.BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForMultipleChoice(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.new = NewModel(config, add_pooling_layer=True)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
+            num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
+            `input_ids` above)
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+        input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
+        attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
+        token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
+        position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
+        inputs_embeds = (
+            inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
+            if inputs_embeds is not None
+            else None
+        )
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        pooled_output = outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = logits.view(-1, num_choices)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(reshaped_logits, labels)
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return MultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+@dataclass
+class NewTokenClassifierOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+class NewForTokenClassification(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.new = NewModel(config, add_pooling_layer=False)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], NewTokenClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return NewTokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            last_hidden_state=sequence_output,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+class NewForQuestionAnswering(NewPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.new = NewModel(config, add_pooling_layer=False)
+        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        start_positions: Optional[torch.Tensor] = None,
+        end_positions: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        unpad_inputs: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
+        r"""
+        start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.new(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            unpad_inputs=unpad_inputs,
+        )
+        sequence_output = outputs[0]
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+            loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "max_seq_length": 1024,
+  "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e802fe5337779428818439760a1e6161ed36ceed72d4ebcbda9c139a2108fc99
+size 17082988

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "max_length": 1024,
+  "model_max_length": 1024,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "</s>",
+  "stride": 0,
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "<unk>"
+}