First model version

Browse files

Files changed (16) hide show

README.md +742 -1
all_results.json +15 -0
config.json +34 -0
configuration_RW.py +79 -0
eval_results.json +10 -0
generation_config.json +6 -0
merges.txt +0 -0
modelling_RW.py +1100 -0
pytorch_model.bin +3 -0
special_tokens_map.json +24 -0
tokenizer.json +0 -0
tokenizer_config.json +33 -0
train_results.json +8 -0
trainer_state.json +0 -0
training_args.bin +3 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,744 @@
 ---
-license: apache-2.0
 ---

 ---
+tags:
+- generated_from_trainer
+datasets:
+- /bscdata/data/open_data_26B_tokens_balanced_es_ca/open_data_26B_tokens_balanced_es_ca.py
+metrics:
+- accuracy
+model-index:
+- name: falcon_7b_balanced_tokenizer_fp16_CPT_open_data_26B_tokens_balanced_es_ca
+  results:
+  - task:
+      name: Causal Language Modeling
+      type: text-generation
+    dataset:
+      name: /bscdata/data/open_data_26B_tokens_balanced_es_ca/open_data_26B_tokens_balanced_es_ca.py
+        default
+      type: /bscdata/data/open_data_26B_tokens_balanced_es_ca/open_data_26B_tokens_balanced_es_ca.py
+      config: default
+      split: validation
+      args: default
+    metrics:
+    - name: Accuracy
+      type: accuracy
+      value: 0.5258444783433934
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# falcon_7b_balanced_tokenizer_fp16_CPT_open_data_26B_tokens_balanced_es_ca
+This model is a fine-tuned version of [/bscdata/models/falcon_7b_balanced_tokenizer_fp16/](https://huggingface.co//bscdata/models/falcon_7b_balanced_tokenizer_fp16/) on the /bscdata/data/open_data_26B_tokens_balanced_es_ca/open_data_26B_tokens_balanced_es_ca.py default dataset.
+It achieves the following results on the evaluation set:
+- Loss: 2.1504
+- Accuracy: 0.5258
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 1
+- eval_batch_size: 1
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 8
+- total_train_batch_size: 8
+- total_eval_batch_size: 8
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- num_epochs: 1.0
+### Training results
+| Training Loss | Epoch | Step    | Accuracy | Validation Loss |
+|:-------------:|:-----:|:-------:|:--------:|:---------------:|
+| 5.3279        | 0.0   | 5000    | 0.3133   | 3.9941          |
+| 3.5754        | 0.0   | 10000   | 0.3824   | 3.3105          |
+| 3.6102        | 0.0   | 15000   | 0.3977   | 3.1660          |
+| 3.0639        | 0.01  | 20000   | 0.4134   | 3.0215          |
+| 2.9477        | 0.01  | 25000   | 0.4252   | 2.9199          |
+| 2.8589        | 0.01  | 30000   | 0.4315   | 2.8672          |
+| 2.8063        | 0.01  | 35000   | 0.4388   | 2.8027          |
+| 2.7646        | 0.01  | 40000   | 0.4419   | 2.7715          |
+| 2.7306        | 0.01  | 45000   | 0.4467   | 2.7363          |
+| 2.7106        | 0.01  | 50000   | 0.4493   | 2.7129          |
+| 2.6829        | 0.02  | 55000   | 0.4522   | 2.6895          |
+| 2.6703        | 0.02  | 60000   | 0.4537   | 2.6758          |
+| 2.6522        | 0.02  | 65000   | 0.4560   | 2.6602          |
+| 2.6377        | 0.02  | 70000   | 0.4574   | 2.6484          |
+| 2.6241        | 0.02  | 75000   | 0.4587   | 2.6348          |
+| 2.6159        | 0.02  | 80000   | 0.4604   | 2.625           |
+| 2.5959        | 0.03  | 85000   | 0.4613   | 2.6133          |
+| 2.5877        | 0.03  | 90000   | 0.4624   | 2.6035          |
+| 2.5832        | 0.03  | 95000   | 0.4632   | 2.5996          |
+| 2.5726        | 0.03  | 100000  | 0.4648   | 2.5859          |
+| 2.5723        | 0.03  | 105000  | 0.4655   | 2.5801          |
+| 2.5584        | 0.03  | 110000  | 0.4641   | 2.5938          |
+| 2.5541        | 0.03  | 115000  | 0.4673   | 2.5664          |
+| 2.541         | 0.04  | 120000  | 0.4684   | 2.5586          |
+| 2.5359        | 0.04  | 125000  | 0.4674   | 2.5645          |
+| 2.5298        | 0.04  | 130000  | 0.4699   | 2.5449          |
+| 2.5258        | 0.04  | 135000  | 0.4703   | 2.5410          |
+| 2.5207        | 0.04  | 140000  | 0.4709   | 2.5371          |
+| 2.5167        | 0.04  | 145000  | 0.4719   | 2.5312          |
+| 2.5101        | 0.04  | 150000  | 0.4702   | 2.5449          |
+| 2.5058        | 0.05  | 155000  | 0.4730   | 2.5215          |
+| 2.5021        | 0.05  | 160000  | 0.4734   | 2.5195          |
+| 2.8135        | 0.05  | 165000  | 0.4317   | 2.8320          |
+| 2.7932        | 0.05  | 170000  | 0.4730   | 2.5215          |
+| 2.4914        | 0.05  | 175000  | 0.4752   | 2.5059          |
+| 2.487         | 0.05  | 180000  | 0.4754   | 2.5039          |
+| 2.4829        | 0.06  | 185000  | 0.4751   | 2.5039          |
+| 2.4778        | 0.06  | 190000  | 0.4763   | 2.4961          |
+| 2.4779        | 0.06  | 195000  | 0.4770   | 2.4922          |
+| 2.4685        | 0.06  | 200000  | 0.4766   | 2.4941          |
+| 2.4661        | 0.06  | 205000  | 0.4776   | 2.4844          |
+| 2.4579        | 0.06  | 210000  | 0.4783   | 2.4805          |
+| 2.4589        | 0.06  | 215000  | 0.4788   | 2.4785          |
+| 2.4571        | 0.07  | 220000  | 0.4793   | 2.4746          |
+| 2.4504        | 0.07  | 225000  | 0.4797   | 2.4727          |
+| 2.4538        | 0.07  | 230000  | 0.4800   | 2.4688          |
+| 2.4481        | 0.07  | 235000  | 0.4806   | 2.4668          |
+| 2.4454        | 0.07  | 240000  | 0.4810   | 2.4609          |
+| 2.44          | 0.07  | 245000  | 0.4811   | 2.4590          |
+| 2.4392        | 0.07  | 250000  | 0.4811   | 2.4590          |
+| 2.431         | 0.08  | 255000  | 0.4813   | 2.4570          |
+| 2.4377        | 0.08  | 260000  | 0.4823   | 2.4512          |
+| 2.4299        | 0.08  | 265000  | 0.4826   | 2.4473          |
+| 2.4283        | 0.08  | 270000  | 0.4828   | 2.4473          |
+| 2.4256        | 0.08  | 275000  | 0.4833   | 2.4434          |
+| 2.4198        | 0.08  | 280000  | 0.4838   | 2.4414          |
+| 2.4174        | 0.09  | 285000  | 0.4840   | 2.4414          |
+| 2.4151        | 0.09  | 290000  | 0.4844   | 2.4355          |
+| 2.4191        | 0.09  | 295000  | 0.4847   | 2.4336          |
+| 2.4071        | 0.09  | 300000  | 0.4848   | 2.4316          |
+| 2.4126        | 0.09  | 305000  | 0.4855   | 2.4277          |
+| 2.4053        | 0.09  | 310000  | 0.4851   | 2.4297          |
+| 2.4071        | 0.09  | 315000  | 0.4858   | 2.4258          |
+| 2.4027        | 0.1   | 320000  | 0.4866   | 2.4219          |
+| 2.4013        | 0.1   | 325000  | 0.4867   | 2.4180          |
+| 2.4032        | 0.1   | 330000  | 0.4866   | 2.4180          |
+| 2.3919        | 0.1   | 335000  | 0.4871   | 2.4160          |
+| 2.3936        | 0.1   | 340000  | 0.4873   | 2.4141          |
+| 2.3905        | 0.1   | 345000  | 0.4878   | 2.4102          |
+| 2.3889        | 0.1   | 350000  | 0.4881   | 2.4102          |
+| 2.3866        | 0.11  | 355000  | 0.4884   | 2.4082          |
+| 2.3823        | 0.11  | 360000  | 0.4888   | 2.4062          |
+| 2.3828        | 0.11  | 365000  | 0.4888   | 2.4023          |
+| 2.3795        | 0.11  | 370000  | 0.4889   | 2.4004          |
+| 2.3812        | 0.11  | 375000  | 0.4868   | 2.4160          |
+| 2.3789        | 0.11  | 380000  | 0.4896   | 2.3965          |
+| 2.372         | 0.12  | 385000  | 0.4895   | 2.3965          |
+| 2.3732        | 0.12  | 390000  | 0.4899   | 2.3965          |
+| 2.3725        | 0.12  | 395000  | 0.4903   | 2.3926          |
+| 2.3716        | 0.12  | 400000  | 0.4904   | 2.3906          |
+| 2.3709        | 0.12  | 405000  | 0.4904   | 2.3906          |
+| 2.3619        | 0.12  | 410000  | 0.4906   | 2.3887          |
+| 2.367         | 0.12  | 415000  | 0.4912   | 2.3867          |
+| 2.3639        | 0.13  | 420000  | 0.4912   | 2.3848          |
+| 2.3621        | 0.13  | 425000  | 0.4919   | 2.3828          |
+| 2.3578        | 0.13  | 430000  | 0.4920   | 2.3809          |
+| 2.3608        | 0.13  | 435000  | 0.4922   | 2.3789          |
+| 2.3541        | 0.13  | 440000  | 0.4923   | 2.3770          |
+| 2.3556        | 0.13  | 445000  | 0.4926   | 2.3770          |
+| 2.3562        | 0.13  | 450000  | 0.4928   | 2.3770          |
+| 2.3641        | 0.14  | 455000  | 0.4910   | 2.3867          |
+| 2.3641        | 0.14  | 460000  | 0.4911   | 2.3867          |
+| 2.3646        | 0.14  | 465000  | 0.4911   | 2.3867          |
+| 2.3629        | 0.14  | 470000  | 0.4911   | 2.3848          |
+| 2.3659        | 0.14  | 475000  | 0.4914   | 2.3828          |
+| 2.3651        | 0.14  | 480000  | 0.4916   | 2.3828          |
+| 2.3608        | 0.15  | 485000  | 0.4918   | 2.3809          |
+| 2.3612        | 0.15  | 490000  | 0.4920   | 2.3809          |
+| 2.3569        | 0.15  | 495000  | 0.4922   | 2.3789          |
+| 2.3557        | 0.15  | 500000  | 0.4923   | 2.3789          |
+| 2.3541        | 0.15  | 505000  | 0.4922   | 2.3770          |
+| 2.351         | 0.15  | 510000  | 0.4927   | 2.375           |
+| 2.3504        | 0.15  | 515000  | 0.4926   | 2.375           |
+| 2.3479        | 0.16  | 520000  | 0.4929   | 2.3730          |
+| 2.3451        | 0.16  | 525000  | 0.4929   | 2.3711          |
+| 2.3505        | 0.16  | 530000  | 0.4934   | 2.3691          |
+| 2.3457        | 0.16  | 535000  | 0.4934   | 2.3691          |
+| 2.3479        | 0.16  | 540000  | 0.4937   | 2.3691          |
+| 2.3421        | 0.16  | 545000  | 0.4936   | 2.3672          |
+| 2.3433        | 0.16  | 550000  | 0.4937   | 2.3672          |
+| 2.3425        | 0.17  | 555000  | 0.4939   | 2.3652          |
+| 2.3403        | 0.17  | 560000  | 0.4942   | 2.3633          |
+| 2.3417        | 0.17  | 565000  | 0.4944   | 2.3613          |
+| 2.3382        | 0.17  | 570000  | 0.4947   | 2.3613          |
+| 2.3354        | 0.17  | 575000  | 0.4949   | 2.3594          |
+| 2.3366        | 0.17  | 580000  | 0.4947   | 2.3594          |
+| 2.3373        | 0.18  | 585000  | 0.4945   | 2.3594          |
+| 2.3365        | 0.18  | 590000  | 0.4949   | 2.3594          |
+| 2.3318        | 0.18  | 595000  | 0.4953   | 2.3555          |
+| 2.3278        | 0.18  | 600000  | 0.4958   | 2.3535          |
+| 2.3277        | 0.18  | 605000  | 0.4959   | 2.3516          |
+| 2.326         | 0.18  | 610000  | 0.4961   | 2.3516          |
+| 2.3273        | 0.18  | 615000  | 0.4961   | 2.3516          |
+| 2.3284        | 0.19  | 620000  | 0.4965   | 2.3496          |
+| 2.3276        | 0.19  | 625000  | 0.4966   | 2.3477          |
+| 2.3228        | 0.19  | 630000  | 0.4967   | 2.3457          |
+| 2.3219        | 0.19  | 635000  | 0.4968   | 2.3457          |
+| 2.326         | 0.19  | 640000  | 0.4970   | 2.3438          |
+| 2.3191        | 0.19  | 645000  | 0.4972   | 2.3418          |
+| 2.3167        | 0.19  | 650000  | 0.4973   | 2.3438          |
+| 2.3172        | 0.2   | 655000  | 0.4974   | 2.3418          |
+| 2.3194        | 0.2   | 660000  | 0.4977   | 2.3379          |
+| 2.3204        | 0.2   | 665000  | 0.4976   | 2.3398          |
+| 2.309         | 0.2   | 670000  | 0.4980   | 2.3359          |
+| 2.3147        | 0.2   | 675000  | 0.4981   | 2.3379          |
+| 2.3122        | 0.2   | 680000  | 0.4980   | 2.3359          |
+| 2.3096        | 0.21  | 685000  | 0.4984   | 2.3340          |
+| 2.3093        | 0.21  | 690000  | 0.4986   | 2.3340          |
+| 2.3048        | 0.21  | 695000  | 0.4985   | 2.3320          |
+| 2.3111        | 0.21  | 700000  | 0.4988   | 2.3301          |
+| 2.3074        | 0.21  | 705000  | 0.4989   | 2.3301          |
+| 2.3082        | 0.21  | 710000  | 0.4992   | 2.3301          |
+| 2.3093        | 0.21  | 715000  | 0.4994   | 2.3281          |
+| 2.3011        | 0.22  | 720000  | 0.4995   | 2.3281          |
+| 2.2998        | 0.22  | 725000  | 0.4995   | 2.3262          |
+| 2.3012        | 0.22  | 730000  | 0.4996   | 2.3262          |
+| 2.3002        | 0.22  | 735000  | 0.4997   | 2.3242          |
+| 2.2994        | 0.22  | 740000  | 0.5000   | 2.3242          |
+| 2.299         | 0.22  | 745000  | 0.5001   | 2.3223          |
+| 2.2969        | 0.22  | 750000  | 0.5003   | 2.3223          |
+| 2.2934        | 0.23  | 755000  | 0.5004   | 2.3203          |
+| 2.2988        | 0.23  | 760000  | 0.5005   | 2.3184          |
+| 2.2911        | 0.23  | 765000  | 0.5007   | 2.3184          |
+| 2.2929        | 0.23  | 770000  | 0.5008   | 2.3184          |
+| 2.2926        | 0.23  | 775000  | 0.5009   | 2.3164          |
+| 2.292         | 0.23  | 780000  | 0.5012   | 2.3164          |
+| 2.2932        | 0.24  | 785000  | 0.5014   | 2.3145          |
+| 2.2903        | 0.24  | 790000  | 0.5014   | 2.3145          |
+| 2.2886        | 0.24  | 795000  | 0.5015   | 2.3125          |
+| 2.2924        | 0.24  | 800000  | 0.5015   | 2.3125          |
+| 2.2891        | 0.24  | 805000  | 0.5019   | 2.3105          |
+| 2.2862        | 0.24  | 810000  | 0.5020   | 2.3086          |
+| 2.2858        | 0.24  | 815000  | 0.5022   | 2.3086          |
+| 2.2841        | 0.25  | 820000  | 0.5023   | 2.3066          |
+| 2.2843        | 0.25  | 825000  | 0.5022   | 2.3086          |
+| 2.2832        | 0.25  | 830000  | 0.5025   | 2.3066          |
+| 2.2846        | 0.25  | 835000  | 0.5026   | 2.3066          |
+| 2.2784        | 0.25  | 840000  | 0.5027   | 2.3047          |
+| 2.277         | 0.25  | 845000  | 0.5028   | 2.3027          |
+| 2.276         | 0.25  | 850000  | 0.5026   | 2.3066          |
+| 2.2802        | 0.26  | 855000  | 0.5031   | 2.3027          |
+| 2.2781        | 0.26  | 860000  | 0.5032   | 2.3008          |
+| 2.2749        | 0.26  | 865000  | 0.5038   | 2.2988          |
+| 2.2729        | 0.26  | 870000  | 0.5037   | 2.2969          |
+| 2.2708        | 0.26  | 875000  | 0.5039   | 2.2969          |
+| 2.2754        | 0.26  | 880000  | 0.5039   | 2.2969          |
+| 2.2761        | 0.27  | 885000  | 0.5041   | 2.2949          |
+| 2.2742        | 0.27  | 890000  | 0.5041   | 2.2949          |
+| 2.2734        | 0.27  | 895000  | 0.5041   | 2.2949          |
+| 2.2682        | 0.27  | 900000  | 0.5044   | 2.2930          |
+| 2.2667        | 0.27  | 905000  | 0.5045   | 2.2930          |
+| 2.2676        | 0.27  | 910000  | 0.5046   | 2.2930          |
+| 2.2707        | 0.27  | 915000  | 0.5047   | 2.2910          |
+| 2.265         | 0.28  | 920000  | 0.5048   | 2.2910          |
+| 2.2676        | 0.28  | 925000  | 0.5046   | 2.2910          |
+| 2.2662        | 0.28  | 930000  | 0.5052   | 2.2891          |
+| 2.2706        | 0.28  | 935000  | 0.5051   | 2.2891          |
+| 2.2657        | 0.28  | 940000  | 0.5049   | 2.2891          |
+| 2.2672        | 0.28  | 945000  | 0.5050   | 2.2871          |
+| 2.2716        | 0.28  | 950000  | 0.5037   | 2.2969          |
+| 2.2702        | 0.29  | 955000  | 0.5037   | 2.2988          |
+| 2.2708        | 0.29  | 960000  | 0.5035   | 2.2988          |
+| 2.2738        | 0.29  | 965000  | 0.5035   | 2.2988          |
+| 2.2737        | 0.29  | 970000  | 0.5036   | 2.2988          |
+| 2.2763        | 0.29  | 975000  | 0.4987   | 2.3301          |
+| 2.2738        | 0.29  | 980000  | 0.5035   | 2.2969          |
+| 2.2737        | 0.3   | 985000  | 0.5036   | 2.2969          |
+| 2.2748        | 0.3   | 990000  | 0.5036   | 2.2969          |
+| 2.2724        | 0.3   | 995000  | 0.5038   | 2.2969          |
+| 2.2744        | 0.3   | 1000000 | 0.5033   | 2.2988          |
+| 2.2694        | 0.3   | 1005000 | 0.5033   | 2.2988          |
+| 2.2684        | 0.3   | 1010000 | 0.5039   | 2.2949          |
+| 2.2731        | 0.3   | 1015000 | 0.5040   | 2.2949          |
+| 2.2714        | 0.31  | 1020000 | 0.5042   | 2.2949          |
+| 2.2687        | 0.31  | 1025000 | 0.5045   | 2.2930          |
+| 2.2673        | 0.31  | 1030000 | 0.5046   | 2.2930          |
+| 2.2677        | 0.31  | 1035000 | 0.5044   | 2.2930          |
+| 2.265         | 0.31  | 1040000 | 0.5047   | 2.2910          |
+| 2.2659        | 0.31  | 1045000 | 0.5045   | 2.2910          |
+| 2.2633        | 0.31  | 1050000 | 0.5042   | 2.2949          |
+| 2.2689        | 0.32  | 1055000 | 0.5050   | 2.2891          |
+| 2.2617        | 0.32  | 1060000 | 0.5049   | 2.2891          |
+| 2.2613        | 0.32  | 1065000 | 0.5052   | 2.2871          |
+| 2.2649        | 0.32  | 1070000 | 0.5047   | 2.2891          |
+| 2.2587        | 0.32  | 1075000 | 0.5053   | 2.2871          |
+| 2.2641        | 0.32  | 1080000 | 0.5054   | 2.2852          |
+| 2.2634        | 0.33  | 1085000 | 0.5057   | 2.2852          |
+| 2.2597        | 0.33  | 1090000 | 0.5057   | 2.2832          |
+| 2.2572        | 0.33  | 1095000 | 0.5060   | 2.2832          |
+| 2.2566        | 0.33  | 1100000 | 0.5056   | 2.2832          |
+| 2.2576        | 0.33  | 1105000 | 0.5056   | 2.2832          |
+| 2.2612        | 0.33  | 1110000 | 0.5057   | 2.2832          |
+| 2.2585        | 0.33  | 1115000 | 0.5059   | 2.2812          |
+| 2.2528        | 0.34  | 1120000 | 0.5060   | 2.2812          |
+| 2.2599        | 0.34  | 1125000 | 0.5060   | 2.2812          |
+| 2.2556        | 0.34  | 1130000 | 0.5066   | 2.2773          |
+| 2.2519        | 0.34  | 1135000 | 0.5064   | 2.2793          |
+| 2.2567        | 0.34  | 1140000 | 0.5068   | 2.2773          |
+| 2.2516        | 0.34  | 1145000 | 0.5069   | 2.2754          |
+| 2.2533        | 0.34  | 1150000 | 0.5068   | 2.2754          |
+| 2.2532        | 0.35  | 1155000 | 0.5070   | 2.2754          |
+| 2.2572        | 0.35  | 1160000 | 0.5064   | 2.2793          |
+| 2.2514        | 0.35  | 1165000 | 0.5072   | 2.2734          |
+| 2.2471        | 0.35  | 1170000 | 0.5073   | 2.2734          |
+| 2.2524        | 0.35  | 1175000 | 0.5076   | 2.2715          |
+| 2.247         | 0.35  | 1180000 | 0.5073   | 2.2715          |
+| 2.2491        | 0.35  | 1185000 | 0.5077   | 2.2715          |
+| 2.2481        | 0.36  | 1190000 | 0.5078   | 2.2695          |
+| 2.2465        | 0.36  | 1195000 | 0.5069   | 2.2734          |
+| 2.2494        | 0.36  | 1200000 | 0.5067   | 2.2793          |
+| 2.2541        | 0.36  | 1205000 | 0.5069   | 2.2754          |
+| 2.25          | 0.36  | 1210000 | 0.5067   | 2.2754          |
+| 2.25          | 0.36  | 1215000 | 0.5064   | 2.2793          |
+| 2.2508        | 0.37  | 1220000 | 0.5070   | 2.2734          |
+| 2.2496        | 0.37  | 1225000 | 0.5070   | 2.2734          |
+| 2.2499        | 0.37  | 1230000 | 0.5073   | 2.2734          |
+| 2.2467        | 0.37  | 1235000 | 0.5076   | 2.2715          |
+| 2.2497        | 0.37  | 1240000 | 0.5073   | 2.2715          |
+| 2.2463        | 0.37  | 1245000 | 0.5073   | 2.2715          |
+| 2.2479        | 0.37  | 1250000 | 0.5078   | 2.2695          |
+| 2.2445        | 0.38  | 1255000 | 0.5079   | 2.2695          |
+| 2.247         | 0.38  | 1260000 | 0.5078   | 2.2695          |
+| 2.2443        | 0.38  | 1265000 | 0.5079   | 2.2676          |
+| 2.243         | 0.38  | 1270000 | 0.5081   | 2.2676          |
+| 2.2454        | 0.38  | 1275000 | 0.5077   | 2.2715          |
+| 2.2451        | 0.38  | 1280000 | 0.5081   | 2.2695          |
+| 2.2455        | 0.38  | 1285000 | 0.5084   | 2.2656          |
+| 2.241         | 0.39  | 1290000 | 0.5083   | 2.2676          |
+| 2.243         | 0.39  | 1295000 | 0.5086   | 2.2637          |
+| 2.2408        | 0.39  | 1300000 | 0.5084   | 2.2637          |
+| 2.2508        | 0.39  | 1305000 | 0.5063   | 2.2793          |
+| 2.252         | 0.39  | 1310000 | 0.5047   | 2.2910          |
+| 2.7482        | 0.39  | 1315000 | 0.4506   | 2.6465          |
+| 2.4189        | 0.4   | 1320000 | 0.5070   | 2.2754          |
+| 2.2446        | 0.4   | 1325000 | 0.5081   | 2.2676          |
+| 2.2416        | 0.4   | 1330000 | 0.5087   | 2.2637          |
+| 2.2421        | 0.4   | 1335000 | 0.5088   | 2.2617          |
+| 2.2367        | 0.4   | 1340000 | 0.5092   | 2.2617          |
+| 2.2355        | 0.4   | 1345000 | 0.5091   | 2.2598          |
+| 2.2379        | 0.4   | 1350000 | 0.5094   | 2.2598          |
+| 2.2365        | 0.41  | 1355000 | 0.5094   | 2.2598          |
+| 2.2379        | 0.41  | 1360000 | 0.5091   | 2.2578          |
+| 2.235         | 0.41  | 1365000 | 0.5095   | 2.2578          |
+| 2.236         | 0.41  | 1370000 | 0.5093   | 2.2578          |
+| 2.2344        | 0.41  | 1375000 | 0.5095   | 2.2578          |
+| 2.2348        | 0.41  | 1380000 | 0.5096   | 2.2559          |
+| 2.2306        | 0.41  | 1385000 | 0.5097   | 2.2559          |
+| 2.2293        | 0.42  | 1390000 | 0.5098   | 2.2559          |
+| 2.2311        | 0.42  | 1395000 | 0.5101   | 2.2539          |
+| 2.231         | 0.42  | 1400000 | 0.5101   | 2.2539          |
+| 2.2272        | 0.42  | 1405000 | 0.5102   | 2.2520          |
+| 2.2264        | 0.42  | 1410000 | 0.5102   | 2.2539          |
+| 2.2295        | 0.42  | 1415000 | 0.5104   | 2.2520          |
+| 2.2281        | 0.43  | 1420000 | 0.5104   | 2.2520          |
+| 2.2234        | 0.43  | 1425000 | 0.5107   | 2.25            |
+| 2.2293        | 0.43  | 1430000 | 0.5107   | 2.25            |
+| 2.2256        | 0.43  | 1435000 | 0.5109   | 2.25            |
+| 2.2247        | 0.43  | 1440000 | 0.5108   | 2.25            |
+| 2.222         | 0.43  | 1445000 | 0.5108   | 2.25            |
+| 2.2228        | 0.43  | 1450000 | 0.5106   | 2.2480          |
+| 2.2241        | 0.44  | 1455000 | 0.5111   | 2.2480          |
+| 2.2219        | 0.44  | 1460000 | 0.5111   | 2.2461          |
+| 2.2219        | 0.44  | 1465000 | 0.5113   | 2.2461          |
+| 2.2215        | 0.44  | 1470000 | 0.5113   | 2.2461          |
+| 2.2193        | 0.44  | 1475000 | 0.5116   | 2.2441          |
+| 2.2183        | 0.44  | 1480000 | 0.5115   | 2.2441          |
+| 2.2177        | 0.44  | 1485000 | 0.5116   | 2.2441          |
+| 2.2211        | 0.45  | 1490000 | 0.5116   | 2.2422          |
+| 2.2183        | 0.45  | 1495000 | 0.5118   | 2.2422          |
+| 2.2182        | 0.45  | 1500000 | 0.5120   | 2.2402          |
+| 2.2148        | 0.45  | 1505000 | 0.5122   | 2.2402          |
+| 2.2217        | 0.45  | 1510000 | 0.5123   | 2.2402          |
+| 2.2117        | 0.45  | 1515000 | 0.5124   | 2.2383          |
+| 2.2152        | 0.46  | 1520000 | 0.5123   | 2.2383          |
+| 2.2148        | 0.46  | 1525000 | 0.5125   | 2.2383          |
+| 2.2151        | 0.46  | 1530000 | 0.5127   | 2.2363          |
+| 2.2129        | 0.46  | 1535000 | 0.5127   | 2.2363          |
+| 2.2145        | 0.46  | 1540000 | 0.5128   | 2.2363          |
+| 2.2099        | 0.46  | 1545000 | 0.5129   | 2.2363          |
+| 2.2125        | 0.46  | 1550000 | 0.5132   | 2.2344          |
+| 2.2101        | 0.47  | 1555000 | 0.5131   | 2.2344          |
+| 2.211         | 0.47  | 1560000 | 0.5132   | 2.2344          |
+| 2.2086        | 0.47  | 1565000 | 0.5132   | 2.2344          |
+| 2.2137        | 0.47  | 1570000 | 0.5132   | 2.2324          |
+| 2.2122        | 0.47  | 1575000 | 0.5134   | 2.2324          |
+| 2.2053        | 0.47  | 1580000 | 0.5134   | 2.2324          |
+| 2.208         | 0.47  | 1585000 | 0.5134   | 2.2305          |
+| 2.2081        | 0.48  | 1590000 | 0.5136   | 2.2305          |
+| 2.2077        | 0.48  | 1595000 | 0.5138   | 2.2305          |
+| 2.2061        | 0.48  | 1600000 | 0.5136   | 2.2305          |
+| 2.2055        | 0.48  | 1605000 | 0.5139   | 2.2285          |
+| 2.2065        | 0.48  | 1610000 | 0.5139   | 2.2285          |
+| 2.2054        | 0.48  | 1615000 | 0.5139   | 2.2285          |
+| 2.2035        | 0.49  | 1620000 | 0.5140   | 2.2285          |
+| 2.2021        | 0.49  | 1625000 | 0.5140   | 2.2285          |
+| 2.2036        | 0.49  | 1630000 | 0.5138   | 2.2285          |
+| 2.204         | 0.49  | 1635000 | 0.5140   | 2.2266          |
+| 2.2042        | 0.49  | 1640000 | 0.5141   | 2.2266          |
+| 2.2024        | 0.49  | 1645000 | 0.5142   | 2.2266          |
+| 2.2023        | 0.49  | 1650000 | 0.5144   | 2.2266          |
+| 2.1976        | 0.5   | 1655000 | 0.5146   | 2.2246          |
+| 2.2028        | 0.5   | 1660000 | 0.5147   | 2.2246          |
+| 2.1971        | 0.5   | 1665000 | 0.5146   | 2.2246          |
+| 2.1978        | 0.5   | 1670000 | 0.5146   | 2.2246          |
+| 2.1955        | 0.5   | 1675000 | 0.5148   | 2.2227          |
+| 2.1967        | 0.5   | 1680000 | 0.5147   | 2.2227          |
+| 2.1975        | 0.5   | 1685000 | 0.5152   | 2.2227          |
+| 2.1972        | 0.51  | 1690000 | 0.5149   | 2.2207          |
+| 2.1967        | 0.51  | 1695000 | 0.5151   | 2.2207          |
+| 2.194         | 0.51  | 1700000 | 0.5151   | 2.2207          |
+| 2.2009        | 0.51  | 1705000 | 0.5139   | 2.2285          |
+| 2.2085        | 0.51  | 1710000 | 0.5136   | 2.2305          |
+| 2.2077        | 0.51  | 1715000 | 0.5137   | 2.2305          |
+| 2.205         | 0.52  | 1720000 | 0.5134   | 2.2305          |
+| 2.2063        | 0.52  | 1725000 | 0.5134   | 2.2305          |
+| 2.2076        | 0.52  | 1730000 | 0.5135   | 2.2305          |
+| 2.2036        | 0.52  | 1735000 | 0.5133   | 2.2305          |
+| 2.2064        | 0.52  | 1740000 | 0.5138   | 2.2305          |
+| 2.2053        | 0.52  | 1745000 | 0.5137   | 2.2305          |
+| 2.2048        | 0.52  | 1750000 | 0.5139   | 2.2305          |
+| 2.2075        | 0.53  | 1755000 | 0.5138   | 2.2305          |
+| 2.2041        | 0.53  | 1760000 | 0.5136   | 2.2285          |
+| 2.2057        | 0.53  | 1765000 | 0.5139   | 2.2285          |
+| 2.2054        | 0.53  | 1770000 | 0.5139   | 2.2285          |
+| 2.2085        | 0.53  | 1775000 | 0.5139   | 2.2285          |
+| 2.2051        | 0.53  | 1780000 | 0.5141   | 2.2266          |
+| 2.2023        | 0.53  | 1785000 | 0.5139   | 2.2266          |
+| 2.205         | 0.54  | 1790000 | 0.5141   | 2.2266          |
+| 2.2009        | 0.54  | 1795000 | 0.5141   | 2.2266          |
+| 2.1998        | 0.54  | 1800000 | 0.5143   | 2.2266          |
+| 2.2009        | 0.54  | 1805000 | 0.5144   | 2.2246          |
+| 2.2027        | 0.54  | 1810000 | 0.5143   | 2.2266          |
+| 2.2007        | 0.54  | 1815000 | 0.5146   | 2.2246          |
+| 2.1978        | 0.55  | 1820000 | 0.5145   | 2.2246          |
+| 2.1999        | 0.55  | 1825000 | 0.5146   | 2.2227          |
+| 2.1978        | 0.55  | 1830000 | 0.5148   | 2.2227          |
+| 2.1989        | 0.55  | 1835000 | 0.5147   | 2.2227          |
+| 2.1989        | 0.55  | 1840000 | 0.5148   | 2.2227          |
+| 2.1982        | 0.55  | 1845000 | 0.5150   | 2.2207          |
+| 2.1974        | 0.55  | 1850000 | 0.5151   | 2.2207          |
+| 2.1972        | 0.56  | 1855000 | 0.5151   | 2.2207          |
+| 2.1966        | 0.56  | 1860000 | 0.5151   | 2.2207          |
+| 2.198         | 0.56  | 1865000 | 0.5150   | 2.2207          |
+| 2.1978        | 0.56  | 1870000 | 0.5152   | 2.2207          |
+| 2.1938        | 0.56  | 1875000 | 0.5152   | 2.2207          |
+| 2.1908        | 0.56  | 1880000 | 0.5152   | 2.2188          |
+| 2.1899        | 0.56  | 1885000 | 0.5152   | 2.2188          |
+| 2.1938        | 0.57  | 1890000 | 0.5152   | 2.2188          |
+| 2.1909        | 0.57  | 1895000 | 0.5154   | 2.2188          |
+| 2.1921        | 0.57  | 1900000 | 0.5155   | 2.2188          |
+| 2.1926        | 0.57  | 1905000 | 0.5156   | 2.2168          |
+| 2.194         | 0.57  | 1910000 | 0.5154   | 2.2168          |
+| 2.1942        | 0.57  | 1915000 | 0.5152   | 2.2188          |
+| 2.1947        | 0.58  | 1920000 | 0.5151   | 2.2188          |
+| 2.1941        | 0.58  | 1925000 | 0.5151   | 2.2207          |
+| 2.1984        | 0.58  | 1930000 | 0.5152   | 2.2207          |
+| 2.1929        | 0.58  | 1935000 | 0.5151   | 2.2207          |
+| 2.1921        | 0.58  | 1940000 | 0.5154   | 2.2188          |
+| 2.1932        | 0.58  | 1945000 | 0.5153   | 2.2188          |
+| 2.1959        | 0.58  | 1950000 | 0.5154   | 2.2188          |
+| 2.1927        | 0.59  | 1955000 | 0.5154   | 2.2188          |
+| 2.1949        | 0.59  | 1960000 | 0.5155   | 2.2188          |
+| 2.1918        | 0.59  | 1965000 | 0.5154   | 2.2168          |
+| 2.1957        | 0.59  | 1970000 | 0.5155   | 2.2168          |
+| 2.1884        | 0.59  | 1975000 | 0.5157   | 2.2168          |
+| 2.1942        | 0.59  | 1980000 | 0.5156   | 2.2148          |
+| 2.1938        | 0.59  | 1985000 | 0.5156   | 2.2168          |
+| 2.1935        | 0.6   | 1990000 | 0.5160   | 2.2148          |
+| 2.1902        | 0.6   | 1995000 | 0.5157   | 2.2148          |
+| 2.188         | 0.6   | 2000000 | 0.5158   | 2.2148          |
+| 2.1862        | 0.6   | 2005000 | 0.5159   | 2.2129          |
+| 2.1886        | 0.6   | 2010000 | 0.5161   | 2.2129          |
+| 2.1811        | 0.6   | 2015000 | 0.5161   | 2.2129          |
+| 2.19          | 0.61  | 2020000 | 0.5160   | 2.2129          |
+| 2.1895        | 0.61  | 2025000 | 0.5165   | 2.2129          |
+| 2.1904        | 0.61  | 2030000 | 0.5161   | 2.2129          |
+| 2.1854        | 0.61  | 2035000 | 0.5165   | 2.2129          |
+| 2.1883        | 0.61  | 2040000 | 0.5165   | 2.2109          |
+| 2.1859        | 0.61  | 2045000 | 0.5165   | 2.2109          |
+| 2.1849        | 0.61  | 2050000 | 0.5168   | 2.2090          |
+| 2.1844        | 0.62  | 2055000 | 0.5167   | 2.2109          |
+| 2.1866        | 0.62  | 2060000 | 0.5167   | 2.2090          |
+| 2.1865        | 0.62  | 2065000 | 0.5168   | 2.2090          |
+| 2.1846        | 0.62  | 2070000 | 0.5171   | 2.2070          |
+| 2.1821        | 0.62  | 2075000 | 0.5170   | 2.2070          |
+| 2.184         | 0.62  | 2080000 | 0.5170   | 2.2070          |
+| 2.1847        | 0.62  | 2085000 | 0.5173   | 2.2051          |
+| 2.1836        | 0.63  | 2090000 | 0.5174   | 2.2051          |
+| 2.1791        | 0.63  | 2095000 | 0.5174   | 2.2051          |
+| 2.1812        | 0.63  | 2100000 | 0.5173   | 2.2051          |
+| 2.1835        | 0.63  | 2105000 | 0.5176   | 2.2051          |
+| 2.1806        | 0.63  | 2110000 | 0.5176   | 2.2051          |
+| 2.1832        | 0.63  | 2115000 | 0.5175   | 2.2051          |
+| 2.1766        | 0.64  | 2120000 | 0.5178   | 2.2031          |
+| 2.1775        | 0.64  | 2125000 | 0.5178   | 2.2031          |
+| 2.1801        | 0.64  | 2130000 | 0.5177   | 2.2031          |
+| 2.1789        | 0.64  | 2135000 | 0.5178   | 2.2031          |
+| 2.1794        | 0.64  | 2140000 | 0.5178   | 2.2031          |
+| 2.1799        | 0.64  | 2145000 | 0.5179   | 2.2012          |
+| 2.1746        | 0.64  | 2150000 | 0.5180   | 2.2012          |
+| 2.1766        | 0.65  | 2155000 | 0.5179   | 2.2012          |
+| 2.1754        | 0.65  | 2160000 | 0.5177   | 2.2012          |
+| 2.1764        | 0.65  | 2165000 | 0.5177   | 2.2012          |
+| 2.1745        | 0.65  | 2170000 | 0.5183   | 2.1992          |
+| 2.1735        | 0.65  | 2175000 | 0.5180   | 2.1992          |
+| 2.1778        | 0.65  | 2180000 | 0.5181   | 2.1992          |
+| 2.1717        | 0.65  | 2185000 | 0.5183   | 2.1992          |
+| 2.1752        | 0.66  | 2190000 | 0.5185   | 2.1973          |
+| 2.1747        | 0.66  | 2195000 | 0.5185   | 2.1973          |
+| 2.1754        | 0.66  | 2200000 | 0.5186   | 2.1973          |
+| 2.1728        | 0.66  | 2205000 | 0.5188   | 2.1973          |
+| 2.1684        | 0.66  | 2210000 | 0.5186   | 2.1973          |
+| 2.1722        | 0.66  | 2215000 | 0.5188   | 2.1953          |
+| 2.1692        | 0.67  | 2220000 | 0.5190   | 2.1953          |
+| 2.176         | 0.67  | 2225000 | 0.5191   | 2.1953          |
+| 2.1697        | 0.67  | 2230000 | 0.5190   | 2.1953          |
+| 2.1731        | 0.67  | 2235000 | 0.5191   | 2.1953          |
+| 2.173         | 0.67  | 2240000 | 0.5191   | 2.1934          |
+| 2.1714        | 0.67  | 2245000 | 0.5193   | 2.1934          |
+| 2.1719        | 0.67  | 2250000 | 0.5192   | 2.1934          |
+| 2.1667        | 0.68  | 2255000 | 0.5190   | 2.1934          |
+| 2.1653        | 0.68  | 2260000 | 0.5192   | 2.1934          |
+| 2.1656        | 0.68  | 2265000 | 0.5193   | 2.1914          |
+| 2.1695        | 0.68  | 2270000 | 0.5194   | 2.1914          |
+| 2.17          | 0.68  | 2275000 | 0.5196   | 2.1914          |
+| 2.1628        | 0.68  | 2280000 | 0.5197   | 2.1914          |
+| 2.1648        | 0.68  | 2285000 | 0.5196   | 2.1895          |
+| 2.1647        | 0.69  | 2290000 | 0.5199   | 2.1895          |
+| 2.1648        | 0.69  | 2295000 | 0.5198   | 2.1895          |
+| 2.168         | 0.69  | 2300000 | 0.5197   | 2.1895          |
+| 2.1607        | 0.69  | 2305000 | 0.5198   | 2.1895          |
+| 2.1674        | 0.69  | 2310000 | 0.5200   | 2.1875          |
+| 2.1656        | 0.69  | 2315000 | 0.5200   | 2.1875          |
+| 2.1637        | 0.7   | 2320000 | 0.5202   | 2.1875          |
+| 2.1649        | 0.7   | 2325000 | 0.5201   | 2.1875          |
+| 2.1625        | 0.7   | 2330000 | 0.5201   | 2.1875          |
+| 2.1627        | 0.7   | 2335000 | 0.5203   | 2.1875          |
+| 2.1598        | 0.7   | 2340000 | 0.5203   | 2.1855          |
+| 2.1638        | 0.7   | 2345000 | 0.5201   | 2.1875          |
+| 2.1588        | 0.7   | 2350000 | 0.5205   | 2.1855          |
+| 2.1633        | 0.71  | 2355000 | 0.5205   | 2.1855          |
+| 2.1621        | 0.71  | 2360000 | 0.5205   | 2.1855          |
+| 2.165         | 0.71  | 2365000 | 0.5207   | 2.1836          |
+| 2.159         | 0.71  | 2370000 | 0.5206   | 2.1836          |
+| 2.1573        | 0.71  | 2375000 | 0.5207   | 2.1836          |
+| 2.1556        | 0.71  | 2380000 | 0.5208   | 2.1836          |
+| 2.1562        | 0.71  | 2385000 | 0.5210   | 2.1836          |
+| 2.1572        | 0.72  | 2390000 | 0.5209   | 2.1836          |
+| 2.1577        | 0.72  | 2395000 | 0.5209   | 2.1816          |
+| 2.1529        | 0.72  | 2400000 | 0.5210   | 2.1816          |
+| 2.1636        | 0.72  | 2405000 | 0.5211   | 2.1816          |
+| 2.1521        | 0.72  | 2410000 | 0.5213   | 2.1816          |
+| 2.1574        | 0.72  | 2415000 | 0.5214   | 2.1816          |
+| 2.1546        | 0.72  | 2420000 | 0.5213   | 2.1797          |
+| 2.1572        | 0.73  | 2425000 | 0.5212   | 2.1797          |
+| 2.1544        | 0.73  | 2430000 | 0.5212   | 2.1797          |
+| 2.15          | 0.73  | 2435000 | 0.5213   | 2.1797          |
+| 2.1537        | 0.73  | 2440000 | 0.5217   | 2.1777          |
+| 2.1552        | 0.73  | 2445000 | 0.5216   | 2.1777          |
+| 2.1522        | 0.73  | 2450000 | 0.5215   | 2.1777          |
+| 2.1487        | 0.74  | 2455000 | 0.5215   | 2.1777          |
+| 2.1582        | 0.74  | 2460000 | 0.5215   | 2.1777          |
+| 2.1582        | 0.74  | 2465000 | 0.5218   | 2.1777          |
+| 2.1529        | 0.74  | 2470000 | 0.5218   | 2.1777          |
+| 2.1549        | 0.74  | 2475000 | 0.5219   | 2.1758          |
+| 2.1525        | 0.74  | 2480000 | 0.5219   | 2.1758          |
+| 2.1478        | 0.74  | 2485000 | 0.5221   | 2.1758          |
+| 2.1524        | 0.75  | 2490000 | 0.5220   | 2.1758          |
+| 2.1477        | 0.75  | 2495000 | 0.5220   | 2.1738          |
+| 2.1524        | 0.75  | 2500000 | 0.5222   | 2.1738          |
+| 2.147         | 0.75  | 2505000 | 0.5222   | 2.1738          |
+| 2.1481        | 0.75  | 2510000 | 0.5223   | 2.1738          |
+| 2.1494        | 0.75  | 2515000 | 0.5223   | 2.1738          |
+| 2.1484        | 0.75  | 2520000 | 0.5223   | 2.1738          |
+| 2.1474        | 0.76  | 2525000 | 0.5223   | 2.1738          |
+| 2.1487        | 0.76  | 2530000 | 0.5223   | 2.1738          |
+| 2.1465        | 0.76  | 2535000 | 0.5225   | 2.1719          |
+| 2.1456        | 0.76  | 2540000 | 0.5226   | 2.1719          |
+| 2.1482        | 0.76  | 2545000 | 0.5224   | 2.1719          |
+| 2.1451        | 0.76  | 2550000 | 0.5226   | 2.1719          |
+| 2.143         | 0.77  | 2555000 | 0.5226   | 2.1719          |
+| 2.1463        | 0.77  | 2560000 | 0.5225   | 2.1719          |
+| 2.1466        | 0.77  | 2565000 | 0.5228   | 2.1699          |
+| 2.1423        | 0.77  | 2570000 | 0.5229   | 2.1699          |
+| 2.1423        | 0.77  | 2575000 | 0.5231   | 2.1699          |
+| 2.1444        | 0.77  | 2580000 | 0.5230   | 2.1699          |
+| 2.1402        | 0.77  | 2585000 | 0.5230   | 2.1680          |
+| 2.1376        | 0.78  | 2590000 | 0.5231   | 2.1680          |
+| 2.1395        | 0.78  | 2595000 | 0.5232   | 2.1680          |
+| 2.1399        | 0.78  | 2600000 | 0.5233   | 2.1680          |
+| 2.1379        | 0.78  | 2605000 | 0.5231   | 2.1680          |
+| 2.1411        | 0.78  | 2610000 | 0.5234   | 2.1660          |
+| 2.1421        | 0.78  | 2615000 | 0.5232   | 2.1660          |
+| 2.1412        | 0.78  | 2620000 | 0.5237   | 2.1660          |
+| 2.1381        | 0.79  | 2625000 | 0.5236   | 2.1660          |
+| 2.142         | 0.79  | 2630000 | 0.5236   | 2.1660          |
+| 2.1394        | 0.79  | 2635000 | 0.5236   | 2.1641          |
+| 2.1384        | 0.79  | 2640000 | 0.5234   | 2.1641          |
+| 2.138         | 0.79  | 2645000 | 0.5236   | 2.1641          |
+| 2.1346        | 0.79  | 2650000 | 0.5239   | 2.1641          |
+| 2.1376        | 0.8   | 2655000 | 0.5239   | 2.1641          |
+| 2.1409        | 0.8   | 2660000 | 0.5240   | 2.1641          |
+| 2.1343        | 0.8   | 2665000 | 0.5240   | 2.1641          |
+| 2.1363        | 0.8   | 2670000 | 0.5240   | 2.1621          |
+| 2.1343        | 0.8   | 2675000 | 0.5242   | 2.1621          |
+| 2.1381        | 0.8   | 2680000 | 0.5243   | 2.1621          |
+| 2.1355        | 0.8   | 2685000 | 0.5241   | 2.1621          |
+| 2.1394        | 0.81  | 2690000 | 0.5242   | 2.1602          |
+| 2.1359        | 0.81  | 2695000 | 0.5245   | 2.1602          |
+| 2.1365        | 0.81  | 2700000 | 0.5244   | 2.1602          |
+| 2.131         | 0.81  | 2705000 | 0.5244   | 2.1602          |
+| 2.1337        | 0.81  | 2710000 | 0.5244   | 2.1602          |
+| 2.1307        | 0.81  | 2715000 | 0.5246   | 2.1582          |
+| 2.1333        | 0.81  | 2720000 | 0.5247   | 2.1582          |
+| 2.1354        | 0.82  | 2725000 | 0.5246   | 2.1582          |
+| 2.1372        | 0.82  | 2730000 | 0.5248   | 2.1582          |
+| 2.1323        | 0.82  | 2735000 | 0.5248   | 2.1582          |
+| 2.1315        | 0.82  | 2740000 | 0.5249   | 2.1562          |
+| 2.1341        | 0.82  | 2745000 | 0.5249   | 2.1562          |
+| 2.132         | 0.82  | 2750000 | 0.5250   | 2.1562          |
+| 2.1322        | 0.83  | 2755000 | 0.5252   | 2.1562          |
+| 2.1298        | 0.83  | 2760000 | 0.5252   | 2.1562          |
+| 2.1285        | 0.83  | 2765000 | 0.5252   | 2.1543          |
+| 2.1299        | 0.83  | 2770000 | 0.5252   | 2.1562          |
+| 2.1304        | 0.83  | 2775000 | 0.5253   | 2.1543          |
+| 2.1288        | 0.83  | 2780000 | 0.5254   | 2.1543          |
+| 2.1295        | 0.83  | 2785000 | 0.5253   | 2.1543          |
+| 2.129         | 0.84  | 2790000 | 0.5255   | 2.1543          |
+| 2.1285        | 0.84  | 2795000 | 0.5254   | 2.1543          |
+| 2.1292        | 0.84  | 2800000 | 0.5253   | 2.1543          |
+| 2.1278        | 0.84  | 2805000 | 0.5256   | 2.1523          |
+| 2.1239        | 0.84  | 2810000 | 0.5255   | 2.1523          |
+| 2.1241        | 0.84  | 2815000 | 0.5259   | 2.1523          |
+| 2.1232        | 0.84  | 2820000 | 0.5257   | 2.1523          |
+| 2.1241        | 0.85  | 2825000 | 0.5257   | 2.1504          |
+| 2.1236        | 0.85  | 2830000 | 0.5259   | 2.1504          |
+| 2.1272        | 0.85  | 2835000 | 0.5259   | 2.1504          |
+| 2.1271        | 0.85  | 2840000 | 0.5261   | 2.1504          |
+| 2.1249        | 0.85  | 2845000 | 0.5262   | 2.1484          |
+| 2.1245        | 0.85  | 2850000 | 0.5260   | 2.1484          |
+| 2.1222        | 0.86  | 2855000 | 0.5261   | 2.1484          |
+| 2.125         | 0.86  | 2860000 | 0.5263   | 2.1484          |
+| 2.1261        | 0.86  | 2865000 | 0.5261   | 2.1484          |
+| 2.1247        | 0.86  | 2870000 | 0.5262   | 2.1484          |
+| 2.1225        | 0.86  | 2875000 | 0.5263   | 2.1484          |
+| 2.122         | 0.86  | 2880000 | 0.5261   | 2.1484          |
+| 2.1237        | 0.86  | 2885000 | 0.5261   | 2.1465          |
+| 2.1219        | 0.87  | 2890000 | 0.5262   | 2.1465          |
+| 2.1248        | 0.87  | 2895000 | 0.5262   | 2.1465          |
+| 2.1191        | 0.87  | 2900000 | 0.5264   | 2.1465          |
+| 2.1181        | 0.87  | 2905000 | 0.5264   | 2.1465          |
+| 2.1176        | 0.87  | 2910000 | 0.5263   | 2.1465          |
+| 2.1191        | 0.87  | 2915000 | 0.5267   | 2.1465          |
+| 2.1206        | 0.87  | 2920000 | 0.5268   | 2.1445          |
+| 2.1148        | 0.88  | 2925000 | 0.5267   | 2.1445          |
+| 2.1188        | 0.88  | 2930000 | 0.5270   | 2.1445          |
+| 2.1118        | 0.88  | 2935000 | 0.5270   | 2.1445          |
+| 2.1283        | 0.88  | 2940000 | 2.1582   | 0.5244          |
+| 2.1336        | 0.88  | 2945000 | 2.1621   | 0.5240          |
+| 2.1311        | 0.88  | 2950000 | 2.1621   | 0.5237          |
+| 2.1377        | 0.89  | 2955000 | 2.1641   | 0.5236          |
+| 2.136         | 0.89  | 2960000 | 2.1641   | 0.5236          |
+| 2.1394        | 0.89  | 2965000 | 2.1641   | 0.5233          |
+| 2.1405        | 0.89  | 2970000 | 2.1660   | 0.5233          |
+| 2.1391        | 0.89  | 2975000 | 2.1660   | 0.5236          |
+| 2.1353        | 0.89  | 2980000 | 2.1660   | 0.5234          |
+| 2.1392        | 0.89  | 2985000 | 2.1660   | 0.5234          |
+| 2.1384        | 0.9   | 2990000 | 2.1660   | 0.5235          |
+| 2.1373        | 0.9   | 2995000 | 2.1660   | 0.5233          |
+| 2.1346        | 0.9   | 3000000 | 2.1660   | 0.5234          |
+| 2.1368        | 0.9   | 3005000 | 2.1660   | 0.5235          |
+| 2.1383        | 0.9   | 3010000 | 2.1660   | 0.5233          |
+| 2.1447        | 0.9   | 3015000 | 2.1660   | 0.5233          |
+| 2.1392        | 0.9   | 3020000 | 2.1660   | 0.5234          |
+| 2.1359        | 0.91  | 3025000 | 2.1660   | 0.5233          |
+| 2.1408        | 0.91  | 3030000 | 2.1660   | 0.5233          |
+| 2.1437        | 0.91  | 3035000 | 2.1660   | 0.5233          |
+| 2.1354        | 0.91  | 3040000 | 2.1660   | 0.5233          |
+| 2.1371        | 0.91  | 3045000 | 2.1660   | 0.5235          |
+| 2.1399        | 0.91  | 3050000 | 2.1660   | 0.5234          |
+| 2.1387        | 0.92  | 3055000 | 2.1660   | 0.5234          |
+| 2.1406        | 0.92  | 3060000 | 2.1660   | 0.5232          |
+| 2.1387        | 0.92  | 3065000 | 2.1660   | 0.5235          |
+| 2.1413        | 0.92  | 3070000 | 2.1660   | 0.5235          |
+| 2.1371        | 0.92  | 3075000 | 2.1641   | 0.5235          |
+| 2.138         | 0.92  | 3080000 | 2.1641   | 0.5235          |
+| 2.1385        | 0.92  | 3085000 | 2.1641   | 0.5236          |
+| 2.135         | 0.93  | 3090000 | 2.1660   | 0.5234          |
+| 2.1401        | 0.93  | 3095000 | 2.1641   | 0.5236          |
+| 2.1374        | 0.93  | 3100000 | 2.1641   | 0.5236          |
+| 2.1358        | 0.93  | 3105000 | 2.1641   | 0.5237          |
+| 2.1344        | 0.93  | 3110000 | 2.1621   | 0.5239          |
+| 2.1368        | 0.93  | 3115000 | 2.1621   | 0.5239          |
+| 2.1345        | 0.93  | 3120000 | 2.1621   | 0.5237          |
+| 2.1358        | 0.94  | 3125000 | 2.1621   | 0.5239          |
+| 2.1395        | 0.94  | 3130000 | 2.1621   | 0.5239          |
+| 2.1359        | 0.94  | 3135000 | 2.1621   | 0.5243          |
+| 2.1373        | 0.94  | 3140000 | 2.1602   | 0.5242          |
+| 2.1357        | 0.94  | 3145000 | 2.1602   | 0.5243          |
+| 2.1354        | 0.94  | 3150000 | 2.1602   | 0.5244          |
+| 2.1323        | 0.95  | 3155000 | 2.1602   | 0.5244          |
+| 2.133         | 0.95  | 3160000 | 2.1602   | 0.5242          |
+| 2.1315        | 0.95  | 3165000 | 2.1602   | 0.5244          |
+| 2.1363        | 0.95  | 3170000 | 2.1602   | 0.5243          |
+| 2.1349        | 0.95  | 3175000 | 2.1602   | 0.5245          |
+| 2.1336        | 0.95  | 3180000 | 2.1602   | 0.5244          |
+| 2.1364        | 0.95  | 3185000 | 2.1582   | 0.5244          |
+| 2.133         | 0.96  | 3190000 | 2.1582   | 0.5243          |
+| 2.1349        | 0.96  | 3195000 | 2.1582   | 0.5245          |
+| 2.134         | 0.96  | 3200000 | 2.1582   | 0.5246          |
+| 2.1308        | 0.96  | 3205000 | 2.1562   | 0.5249          |
+| 2.1302        | 0.96  | 3210000 | 2.1562   | 0.5247          |
+| 2.1302        | 0.96  | 3215000 | 2.1562   | 0.5247          |
+| 2.1331        | 0.96  | 3220000 | 2.1562   | 0.5248          |
+| 2.1273        | 0.97  | 3225000 | 2.1562   | 0.5247          |
+| 2.1286        | 0.97  | 3230000 | 2.1562   | 0.5250          |
+| 2.1282        | 0.97  | 3235000 | 2.1543   | 0.5250          |
+| 2.1309        | 0.97  | 3240000 | 2.1543   | 0.5251          |
+| 2.1295        | 0.97  | 3245000 | 2.1543   | 0.5254          |
+| 2.1275        | 0.97  | 3250000 | 2.1543   | 0.5254          |
+| 2.133         | 0.98  | 3255000 | 2.1543   | 0.5254          |
+| 2.1301        | 0.98  | 3260000 | 2.1543   | 0.5251          |
+| 2.1314        | 0.98  | 3265000 | 2.1523   | 0.5253          |
+| 2.1258        | 0.98  | 3270000 | 2.1523   | 0.5255          |
+| 2.1286        | 0.98  | 3275000 | 2.1523   | 0.5254          |
+| 2.1267        | 0.98  | 3280000 | 2.1523   | 0.5254          |
+| 2.13          | 0.98  | 3285000 | 2.1523   | 0.5254          |
+| 2.1284        | 0.99  | 3290000 | 2.1523   | 0.5255          |
+| 2.1295        | 0.99  | 3295000 | 2.1523   | 0.5254          |
+| 2.1241        | 0.99  | 3300000 | 2.1523   | 0.5256          |
+| 2.1297        | 0.99  | 3305000 | 2.1523   | 0.5258          |
+| 2.126         | 0.99  | 3310000 | 2.1504   | 0.5256          |
+| 2.1263        | 0.99  | 3315000 | 2.1504   | 0.5256          |
+| 2.1273        | 0.99  | 3320000 | 2.1504   | 0.5256          |
+| 2.1214        | 1.0   | 3325000 | 2.1504   | 0.5255          |
+| 2.1275        | 1.0   | 3330000 | 2.1504   | 0.5256          |
+| 2.1227        | 1.0   | 3335000 | 2.1504   | 0.5258          |
+### Framework versions
+- Transformers 4.30.2
+- Pytorch 2.0.0
+- Datasets 2.13.1
+- Tokenizers 0.13.3

all_results.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+    "epoch": 1.0,
+    "eval_accuracy": 0.5258444783433934,
+    "eval_loss": 2.150390625,
+    "eval_runtime": 41.4599,
+    "eval_samples": 3568,
+    "eval_samples_per_second": 86.059,
+    "eval_steps_per_second": 10.757,
+    "perplexity": 8.5882125125473,
+    "train_loss": 0.2576859601399347,
+    "train_runtime": 150388.7494,
+    "train_samples": 26705019,
+    "train_samples_per_second": 177.573,
+    "train_steps_per_second": 22.197
+}

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "_name_or_path": "/bscdata/models/falcon_7b_balanced_tokenizer_fp16/",
+  "alibi": false,
+  "apply_residual_connection_post_layernorm": false,
+  "architectures": [
+    "RWForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_RW.RWConfig",
+    "AutoModel": "modelling_RW.RWModel",
+    "AutoModelForCausalLM": "modelling_RW.RWForCausalLM",
+    "AutoModelForQuestionAnswering": "modelling_RW.RWForQuestionAnswering",
+    "AutoModelForSequenceClassification": "modelling_RW.RWForSequenceClassification",
+    "AutoModelForTokenClassification": "modelling_RW.RWForTokenClassification"
+  },
+  "bias": false,
+  "bos_token_id": 50256,
+  "eos_token_id": 50256,
+  "hidden_dropout": 0.0,
+  "hidden_size": 4544,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "RefinedWebModel",
+  "multi_query": true,
+  "n_head": 71,
+  "n_layer": 32,
+  "pad_token_id": 50256,
+  "parallel_attn": true,
+  "torch_dtype": "float16",
+  "transformers_version": "4.30.2",
+  "use_cache": true,
+  "vocab_size": 50257
+}

configuration_RW.py ADDED Viewed

	@@ -0,0 +1,79 @@

+# coding=utf-8
+# Copyright 2022 the Big Science Workshop and HuggingFace Inc. team.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Bloom configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class RWConfig(PretrainedConfig):
+    model_type = "RefinedWebModel"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    attribute_map = {
+        "num_hidden_layers": "n_layer",
+        "num_attention_heads": "n_head",
+    }
+    def __init__(
+        self,
+        vocab_size=250880,
+        hidden_size=64,
+        n_layer=2,
+        n_head=8,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        use_cache=True,
+        bos_token_id=1,
+        eos_token_id=2,
+        apply_residual_connection_post_layernorm=False,
+        hidden_dropout=0.0,
+        attention_dropout=0.0,
+        multi_query=False,
+        alibi=False,
+        bias=False,
+        parallel_attn=False,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        # Backward compatibility with n_embed kwarg
+        n_embed = kwargs.pop("n_embed", None)
+        self.hidden_size = hidden_size if n_embed is None else n_embed
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.use_cache = use_cache
+        self.apply_residual_connection_post_layernorm = apply_residual_connection_post_layernorm
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.multi_query = multi_query
+        self.alibi = alibi
+        self.bias = bias
+        self.parallel_attn = parallel_attn
+        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+    @property
+    def head_dim(self):
+        return self.hidden_size // self.n_head
+    @property
+    def rotary(self):
+        return not self.alibi

eval_results.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "epoch": 1.0,
+    "eval_accuracy": 0.5258444783433934,
+    "eval_loss": 2.150390625,
+    "eval_runtime": 41.4599,
+    "eval_samples": 3568,
+    "eval_samples_per_second": 86.059,
+    "eval_steps_per_second": 10.757,
+    "perplexity": 8.5882125125473
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.30.2"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

modelling_RW.py ADDED Viewed

	@@ -0,0 +1,1100 @@

+# port of models described in RW
+# We use the bloom model as a starting point for these model.
+# Please refer to the bloom models for usage instructions.
+import math
+import warnings
+from typing import Optional, Tuple, Union
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, LayerNorm, MSELoss
+from torch.nn import functional as F
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutputWithPast,
+    TokenClassifierOutput,
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import logging
+from .configuration_RW import RWConfig
+logger = logging.get_logger(__name__)
+# NOTE(Hesslow): Unfortunately we did not fuse matmul and bias during training, this means that there's one additional quantization to bfloat16 between the operations.
+# In order not to degrade the quality of our HF-port, we keep these characteristics in the final model.
+class Linear(nn.Linear):
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        ret = input @ self.weight.T
+        if self.bias is None:
+            return ret
+        else:
+            return ret + self.bias
+from einops import rearrange
+# rotary pos emb helpers (torch.jit.script does not seem to support staticmethod...)
+def rotate_half(x):
+    x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=x1.ndim - 1)  # dim=-1 triggers a bug in torch < 1.8.0
+class RotaryEmbedding(torch.nn.Module):
+    """Implementation of RotaryEmbedding from GPT-NeoX.
+    This implementation is design to operate on queries and keys that are compatible with
+    [batch_size, n_heads_per_partition, seq_len, head_dim] (e.g. MinGPTAttention format).
+    """
+    def __init__(
+        self,
+        head_dim: int,
+        base=10000,
+    ):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.head_dim = head_dim
+        self.seq_len_cached = None
+        self.batch_size_cached = None
+        self.cos_cached: torch.Tensor | None = None
+        self.sin_cached: torch.Tensor | None = None
+    def cos_sin(
+        self,
+        seq_len: int,
+        device="cuda",
+        dtype=torch.bfloat16,
+    ) -> torch.Tensor:
+        if seq_len != self.seq_len_cached:
+            self.seq_len_cached = seq_len
+            t = torch.arange(seq_len, device=device).type_as(self.inv_freq)
+            freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+            emb = torch.cat((freqs, freqs), dim=-1).to(device)
+            if dtype in [torch.float16, torch.bfloat16]:
+                emb = emb.float()
+            self.cos_cached = emb.cos()[None, :, :]
+            self.sin_cached = emb.sin()[None, :, :]
+            self.cos_cached = self.cos_cached.type(dtype)
+            self.sin_cached = self.sin_cached.type(dtype)
+        return self.cos_cached, self.sin_cached
+    def forward(self, q, k):
+        batch, seq_len, head_dim = q.shape
+        cos, sin = self.cos_sin(seq_len, q.device, q.dtype)
+        return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
+def _make_causal_mask(
+    input_ids_shape: torch.Size, device: torch.device, past_key_values_length: int
+) -> torch.BoolTensor:
+    batch_size, target_length = input_ids_shape
+    mask = torch.empty((target_length, target_length + past_key_values_length), dtype=torch.bool, device=device)
+    # ONNX doesn't support `torch.Tensor.triu` properly, thus we use this workaround
+    seq_ids = torch.arange(target_length, device=device)
+    mask[:, past_key_values_length:] = seq_ids[:, None] < seq_ids[None, :]
+    if past_key_values_length > 0:
+        mask[:, :past_key_values_length] = False
+    expanded_mask = mask[None, None, :, :].expand(batch_size, 1, target_length, target_length + past_key_values_length)
+    return expanded_mask
+def _expand_mask(mask: torch.Tensor, tgt_length: int) -> torch.BoolTensor:
+    batch_size, src_length = mask.shape
+    tgt_length = tgt_length if tgt_length is not None else src_length
+    expanded_mask = ~(mask[:, None, None, :].to(torch.bool))
+    return expanded_mask.expand(batch_size, 1, tgt_length, src_length)
+def build_alibi_tensor(attention_mask: torch.Tensor, num_heads: int, dtype: torch.dtype) -> torch.Tensor:
+    batch_size, seq_length = attention_mask.shape
+    closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
+    base = torch.tensor(
+        2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3))), device=attention_mask.device, dtype=torch.float32
+    )
+    powers = torch.arange(1, 1 + closest_power_of_2, device=attention_mask.device, dtype=torch.int32)
+    slopes = torch.pow(base, powers)
+    if closest_power_of_2 != num_heads:
+        extra_base = torch.tensor(
+            2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3))), device=attention_mask.device, dtype=torch.float32
+        )
+        num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
+        extra_powers = torch.arange(1, 1 + 2 * num_remaining_heads, 2, device=attention_mask.device, dtype=torch.int32)
+        slopes = torch.cat([slopes, torch.pow(extra_base, extra_powers)], dim=0)
+    # Note: alibi will added to the attention bias that will be applied to the query, key product of attention
+    # => therefore alibi will have to be of shape (batch_size, num_heads, query_length, key_length)
+    # => here we set (batch_size=1, num_heads=num_heads, query_length=1, key_length=max_length)
+    # => the query_length dimension will then be broadcasted correctly
+    # This is more or less identical to T5's relative position bias:
+    # https://github.com/huggingface/transformers/blob/f681437203baa7671de3174b0fa583c349d9d5e1/src/transformers/models/t5/modeling_t5.py#L527
+    arange_tensor = ((attention_mask.cumsum(dim=-1) - 1) * attention_mask)[:, None, :]
+    alibi = slopes[..., None].bfloat16() * arange_tensor
+    return alibi.reshape(batch_size * num_heads, 1, seq_length).to(dtype)
+def dropout_add(x: torch.Tensor, residual: torch.Tensor, prob: float, training: bool) -> torch.Tensor:
+    out = F.dropout(x, p=prob, training=training)
+    out = residual + out
+    return out
+class Attention(nn.Module):
+    def __init__(self, config: RWConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.n_head
+        self.head_dim = self.hidden_size // self.num_heads
+        self.split_size = self.hidden_size
+        self.hidden_dropout = config.hidden_dropout
+        if self.head_dim * self.num_heads != self.hidden_size:
+            raise ValueError(
+                f"`hidden_size` must be divisible by num_heads (got `hidden_size`: {self.hidden_size} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.maybe_rotary = RotaryEmbedding(config.head_dim) if config.rotary else lambda q, k: (q, k)
+        # Layer-wise attention scaling
+        self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim)
+        self.beta = self.inv_norm_factor
+        self.query_key_value = Linear(
+            self.hidden_size,
+            3 * self.hidden_size if not config.multi_query else (self.hidden_size + 2 * self.head_dim),
+            bias=config.bias,
+        )
+        self.multi_query = config.multi_query
+        self.dense = Linear(self.hidden_size, self.hidden_size, bias=config.bias)
+        self.attention_dropout = nn.Dropout(config.attention_dropout)
+        self.num_kv = config.n_head if not self.multi_query else 1
+    def _split_heads(self, fused_qkv: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Split the last dimension into (num_heads, head_dim) without making any copies, results share same memory
+        storage as `fused_qkv`
+        Args:
+            fused_qkv (`torch.tensor`, *required*): [batch_size, seq_length, num_heads * 3 * head_dim]
+        Returns:
+            query: [batch_size, seq_length, num_heads, head_dim] key: [batch_size, seq_length, num_heads, head_dim]
+            value: [batch_size, seq_length, num_heads, head_dim]
+        """
+        if not self.multi_query:
+            batch_size, seq_length, three_times_hidden_size = fused_qkv.shape
+            fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads, 3, self.head_dim)
+            return fused_qkv[..., 0, :], fused_qkv[..., 1, :], fused_qkv[..., 2, :]
+        else:
+            batch_size, seq_length, three_times_hidden_size = fused_qkv.shape
+            fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads + 2, self.head_dim)
+            return fused_qkv[..., :-2, :], fused_qkv[..., [-2], :], fused_qkv[..., [-1], :]
+    def _merge_heads(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Merge heads together over the last dimenstion
+        Args:
+            x: (`torch.tensor`, *required*): [batch_size * num_heads, seq_length, head_dim]
+        Returns:
+            torch.tensor: [batch_size, seq_length, num_heads * head_dim]
+        """
+        # What we want to achieve is:
+        # batch_size * num_heads, seq_length, head_dim -> batch_size, seq_length, num_heads * head_dim
+        batch_size_and_num_heads, seq_length, _ = x.shape
+        batch_size = batch_size_and_num_heads // self.num_heads
+        # First view to decompose the batch size
+        # batch_size * num_heads, seq_length, head_dim -> batch_size, num_heads, seq_length, head_dim
+        x = x.view(batch_size, self.num_heads, seq_length, self.head_dim)
+        # batch_size, num_heads, seq_length, head_dim -> batch_size, seq_length, num_heads, head_dim
+        x = x.permute(0, 2, 1, 3)
+        # batch_size, seq_length, num_heads, head_dim -> batch_size, seq_length, num_heads * head_dim
+        return x.reshape(batch_size, seq_length, self.num_heads * self.head_dim)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        alibi: torch.Tensor,
+        attention_mask: torch.Tensor,
+        layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        use_cache: bool = False,
+        output_attentions: bool = False,
+    ):
+        fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
+        # 3 x [batch_size, seq_length, num_heads, head_dim]
+        (query_layer, key_layer, value_layer) = self._split_heads(fused_qkv)
+        batch_size, q_length, _, _ = query_layer.shape
+        query_layer = query_layer.transpose(1, 2).reshape(batch_size * self.num_heads, q_length, self.head_dim)
+        key_layer = key_layer.transpose(1, 2).reshape(
+            batch_size * self.num_kv,
+            q_length,
+            self.head_dim,
+        )
+        value_layer = value_layer.transpose(1, 2).reshape(batch_size * self.num_kv, q_length, self.head_dim)
+        query_layer, key_layer = self.maybe_rotary(query_layer, key_layer)
+        if layer_past is not None:
+            past_key, past_value = layer_past
+            # concatenate along seq_length dimension:
+            #  - key: [batch_size * self.num_heads, head_dim, kv_length]
+            #  - value: [batch_size * self.num_heads, kv_length, head_dim]
+            key_layer = torch.cat((past_key, key_layer), dim=1)
+            value_layer = torch.cat((past_value, value_layer), dim=1)
+        _, kv_length, _ = key_layer.shape
+        if use_cache is True:
+            present = (key_layer, value_layer)
+        else:
+            present = None
+        if alibi is None:
+            query_layer_ = query_layer.reshape(batch_size, self.num_heads, -1, self.head_dim)
+            key_layer_ = key_layer.reshape(batch_size, self.num_kv, -1, self.head_dim)
+            value_layer_ = value_layer.reshape(batch_size, self.num_kv, -1, self.head_dim)
+            attn_output = F.scaled_dot_product_attention(
+                query_layer_, key_layer_, value_layer_, None, 0.0, is_causal=True
+            )
+            x = attn_output.view(batch_size, self.num_heads, q_length, self.head_dim)
+            x = x.permute(0, 2, 1, 3)
+            attn_output = x.reshape(batch_size, q_length, self.num_heads * self.head_dim)
+            output_tensor = self.dense(attn_output)
+            outputs = (output_tensor, present)
+            assert not output_attentions  # not supported.
+            return outputs
+        else:
+            attention_mask_float = (attention_mask * 1.0).masked_fill(attention_mask, -1e9).to(torch.bfloat16)
+            matmul_result = query_layer @ key_layer.transpose(-1, -2)
+            # change view to [batch_size, num_heads, q_length, kv_length]
+            attention_scores = matmul_result.view(batch_size, self.num_heads, q_length, kv_length)
+            # cast attention scores to fp32, compute scaled softmax and cast back to initial dtype - [batch_size, num_heads, q_length, kv_length]
+            input_dtype = attention_scores.dtype
+            # `float16` has a minimum value of -65504.0, whereas `bfloat16` and `float32` have a minimum value of `-3.4e+38`
+            if input_dtype == torch.float16 or input_dtype == torch.bfloat16:
+                attention_scores = attention_scores.to(torch.float32)
+            # attn_weights = torch.masked_fill(attention_scores, attention_mask, torch.finfo(attention_scores.dtype).min)
+            attention_probs = F.softmax(
+                (attention_scores + alibi.view(batch_size, self.num_heads, 1, -1)) * self.inv_norm_factor + attention_mask_float,
+                dim=-1,
+                dtype=hidden_states.dtype,
+            )
+            # [batch_size, num_heads, q_length, kv_length]
+            attention_probs = self.attention_dropout(attention_probs)
+            if head_mask is not None:
+                attention_probs = attention_probs * head_mask
+            # change view [batch_size x num_heads, q_length, kv_length]
+            attention_probs_reshaped = attention_probs.view(batch_size * self.num_heads, q_length, kv_length)
+            # matmul: [batch_size * num_heads, q_length, head_dim]
+            context_layer = attention_probs_reshaped @ value_layer
+            # change view [batch_size, num_heads, q_length, head_dim]
+            context_layer = self._merge_heads(context_layer)
+            output_tensor = self.dense(context_layer)
+            outputs = (output_tensor, present)
+            if output_attentions:
+                outputs += (attention_probs,)
+            return outputs
+class MLP(nn.Module):
+    def __init__(self, config: RWConfig):
+        super().__init__()
+        hidden_size = config.hidden_size
+        self.dense_h_to_4h = Linear(hidden_size, 4 * hidden_size, bias=config.bias)
+        self.act = nn.GELU()
+        self.dense_4h_to_h = Linear(4 * hidden_size, hidden_size, bias=config.bias)
+        self.hidden_dropout = config.hidden_dropout
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.act(self.dense_h_to_4h(x))
+        x = self.dense_4h_to_h(x)
+        return x
+class DecoderLayer(nn.Module):
+    def __init__(self, config: RWConfig):
+        super().__init__()
+        hidden_size = config.hidden_size
+        self.input_layernorm = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+        self.num_heads = config.n_head
+        self.self_attention = Attention(config)
+        if not config.parallel_attn:
+            # unused if parallel attn
+            self.post_attention_layernorm = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+        self.mlp = MLP(config)
+        self.apply_residual_connection_post_layernorm = config.apply_residual_connection_post_layernorm
+        self.hidden_dropout = config.hidden_dropout
+        self.config = config
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        alibi: torch.Tensor,
+        attention_mask: torch.Tensor,
+        layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        use_cache: bool = False,
+        output_attentions: bool = False,
+    ):
+        layernorm_output = self.input_layernorm(hidden_states)
+        residual = hidden_states
+        # Self attention.
+        attn_outputs = self.self_attention(
+            layernorm_output,
+            layer_past=layer_past,
+            attention_mask=attention_mask,
+            alibi=alibi,
+            head_mask=head_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+        )
+        attention_output = attn_outputs[0]
+        if not self.config.parallel_attn:
+            residual = dropout_add(attention_output, residual, self.config.attention_dropout, training=self.training)
+            layernorm_output = self.post_attention_layernorm(residual)
+        outputs = attn_outputs[1:]
+        # MLP.
+        mlp_output = self.mlp(layernorm_output)
+        if self.config.parallel_attn:
+            mlp_output += attention_output
+        output = dropout_add(mlp_output, residual, self.config.hidden_dropout, training=self.training)
+        if use_cache:
+            outputs = (output,) + outputs
+        else:
+            outputs = (output,) + outputs[1:]
+        return outputs  # hidden_states, present, attentions
+class RWPreTrainedModel(PreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"]
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = RWConfig
+    base_model_prefix = "transformer"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["DecoderLayer"]
+    def __init__(self, *inputs, **kwargs):
+        super().__init__(*inputs, **kwargs)
+    def _init_weights(self, module: nn.Module):
+        """Initialize the weights."""
+        if isinstance(module, nn.Linear) or isinstance(module, Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+    def _set_gradient_checkpointing(self, module: nn.Module, value: bool = False):
+        if isinstance(module, RWModel):
+            module.gradient_checkpointing = value
+    @staticmethod
+    def _convert_to_standard_cache(
+        past_key_value: Tuple[Tuple[torch.Tensor, torch.Tensor]], batch_size: int
+    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor]]:
+        """
+        Standardizes the format of the cache so as to match most implementations, i.e. to tuple(tuple([batch_size,
+        num_heads, ...]))
+        """
+        batch_size_times_num_heads, head_dim, seq_length = past_key_value[0][0].shape
+        num_heads = batch_size_times_num_heads // batch_size
+        # key: [batch_size * num_heads, head_dim, seq_length] -> [batch_size, num_heads, head_dim, seq_length]
+        # value: [batch_size * num_heads, seq_length, head_dim] -> [batch_size, num_heads, seq_length, head_dim]
+        return tuple(
+            (
+                layer_past[0].view(batch_size, num_heads, head_dim, seq_length),
+                layer_past[1].view(batch_size, num_heads, seq_length, head_dim),
+            )
+            for layer_past in past_key_value
+        )
+    @staticmethod
+    def _convert_to_rw_cache(
+        past_key_value: Tuple[Tuple[torch.Tensor, torch.Tensor]]
+    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor]]:
+        batch_size, num_heads, head_dim, seq_length = past_key_value[0][0].shape
+        batch_size_times_num_heads = batch_size * num_heads
+        # key:  [batch_size, num_heads, head_dim, seq_length] -> [batch_size * num_heads, head_dim, seq_length]
+        # value: [batch_size, num_heads, seq_length, head_dim] -> [batch_size * num_heads, seq_length, head_dim]
+        return tuple(
+            (
+                layer_past[0].view(batch_size_times_num_heads, head_dim, seq_length),
+                layer_past[1].view(batch_size_times_num_heads, seq_length, head_dim),
+            )
+            for layer_past in past_key_value
+        )
+class RWModel(RWPreTrainedModel):
+    def __init__(self, config: RWConfig):
+        super().__init__(config)
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.n_head
+        self.alibi = config.alibi
+        # Embedding + LN Embedding
+        self.word_embeddings = nn.Embedding(config.vocab_size, self.embed_dim)
+        # Transformer blocks
+        self.h = nn.ModuleList([DecoderLayer(config) for _ in range(config.num_hidden_layers)])
+        # Final Layer Norm
+        self.ln_f = LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.word_embeddings
+    def _prepare_attn_mask(
+        self, attention_mask: torch.Tensor, input_shape: Tuple[int, int], past_key_values_length: int
+    ) -> torch.BoolTensor:
+        # create causal mask
+        # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length]
+        combined_attention_mask = None
+        device = attention_mask.device
+        _, src_length = input_shape
+        if src_length > 1:
+            combined_attention_mask = _make_causal_mask(
+                input_shape, device=device, past_key_values_length=past_key_values_length
+            )
+        # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length]
+        expanded_attn_mask = _expand_mask(attention_mask, tgt_length=src_length)
+        combined_attention_mask = (
+            expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
+        )
+        return combined_attention_mask
+    def set_input_embeddings(self, new_embeddings: torch.Tensor):
+        self.word_embeddings = new_embeddings
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **deprecated_arguments,
+    ) -> Union[Tuple[torch.Tensor, ...], BaseModelOutputWithPastAndCrossAttentions]:
+        if deprecated_arguments.pop("position_ids", False) is not False:
+            # `position_ids` could have been `torch.Tensor` or `None` so defaulting pop to `False` allows to detect if users were passing explicitly `None`
+            warnings.warn(
+                "`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. You can safely ignore"
+                " passing `position_ids`.",
+                FutureWarning,
+            )
+        if len(deprecated_arguments) > 0:
+            raise ValueError(f"Got unexpected arguments: {deprecated_arguments}")
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+        if past_key_values is None:
+            past_key_values = tuple([None] * len(self.h))
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape batch_size x num_heads x N x N
+        # head_mask has shape n_layer x batch x num_heads x N x N
+        head_mask = self.get_head_mask(head_mask, self.config.n_layer)
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        hidden_states = inputs_embeds
+        presents = () if use_cache else None
+        all_self_attentions = () if output_attentions else None
+        all_hidden_states = () if output_hidden_states else None
+        # Compute alibi tensor: check build_alibi_tensor documentation
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+        if past_key_values[0] is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+        if attention_mask is None:
+            attention_mask = torch.ones((batch_size, seq_length_with_past), device=hidden_states.device)
+        else:
+            attention_mask = attention_mask.to(hidden_states.device)
+        if self.alibi:
+            alibi = build_alibi_tensor(attention_mask, self.num_heads, dtype=hidden_states.dtype)
+        else:
+            alibi = None
+        causal_mask = self._prepare_attn_mask(
+            attention_mask,
+            input_shape=(batch_size, seq_length),
+            past_key_values_length=past_key_values_length,
+        )
+        for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                if use_cache:
+                    logger.warning(
+                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                    )
+                    use_cache = False
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        # None for past_key_value
+                        return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
+                    return custom_forward
+                outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(block),
+                    hidden_states,
+                    alibi,
+                    causal_mask,
+                    head_mask[i],
+                )
+            else:
+                outputs = block(
+                    hidden_states,
+                    layer_past=layer_past,
+                    attention_mask=causal_mask,
+                    head_mask=head_mask[i],
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                    alibi=alibi,
+                )
+            hidden_states = outputs[0]
+            if use_cache is True:
+                presents = presents + (outputs[1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
+        # Add last hidden state
+        hidden_states = self.ln_f(hidden_states)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+class RWForCausalLM(RWPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"]
+    def __init__(self, config: RWConfig):
+        super().__init__(config)
+        self.transformer = RWModel(config)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings: torch.Tensor):
+        self.lm_head = new_embeddings
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.LongTensor,
+        past: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> dict:
+        # only last token for input_ids if past is not None
+        if past:
+            input_ids = input_ids[:, -1].unsqueeze(-1)
+            # the cache may be in the stardard format (e.g. in contrastive search), convert to our's format if needed
+            if past[0][0].shape[0] == input_ids.shape[0]:
+                past = self._convert_to_rw_cache(past)
+        return {
+            "input_ids": input_ids,
+            "past_key_values": past,
+            "use_cache": kwargs.get("use_cache"),
+            "attention_mask": attention_mask,
+        }
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **deprecated_arguments,
+    ) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
+            `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
+            are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
+        """
+        if deprecated_arguments.pop("position_ids", False) is not False:
+            # `position_ids` could have been `torch.Tensor` or `None` so defaulting pop to `False` allows to detect if users were passing explicitly `None`
+            warnings.warn(
+                "`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. You can safely ignore"
+                " passing `position_ids`.",
+                FutureWarning,
+            )
+        if len(deprecated_arguments) > 0:
+            raise ValueError(f"Got unexpected arguments: {deprecated_arguments}")
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        lm_logits = self.lm_head(hidden_states)
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            batch_size, seq_length, vocab_size = shift_logits.shape
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(
+                shift_logits.view(batch_size * seq_length, vocab_size), shift_labels.view(batch_size * seq_length)
+            )
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+    def _reorder_cache(
+        self, past: Tuple[Tuple[torch.Tensor, torch.Tensor], ...], beam_idx: torch.LongTensor
+    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor], ...]:
+        """
+        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
+        [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
+        beam_idx at every generation step.
+        Output shares the same memory storage as `past`.
+        """
+        standardized_past = self._convert_to_standard_cache(past, batch_size=len(beam_idx))
+        # Get a copy of `beam_idx` on all the devices where we need those indices.
+        device_to_beam_idx = {
+            past_state.device: beam_idx.to(past_state.device) for layer_past in past for past_state in layer_past
+        }
+        reordered_past = tuple(
+            (
+                layer_past[0].index_select(0, device_to_beam_idx[layer_past[0].device]),
+                layer_past[1].index_select(0, device_to_beam_idx[layer_past[0].device]),
+            )
+            for layer_past in standardized_past
+        )
+        return self._convert_to_rw_cache(reordered_past)
+class RWForSequenceClassification(RWPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"]
+    def __init__(self, config: RWConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.transformer = RWModel(config)
+        self.score = nn.Linear(config.hidden_size, config.num_labels, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **deprecated_arguments,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutputWithPast]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        if deprecated_arguments.pop("position_ids", False) is not False:
+            # `position_ids` could have been `torch.Tensor` or `None` so defaulting pop to `False` allows to detect if users were passing explicitly `None`
+            warnings.warn(
+                "`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. You can safely ignore"
+                " passing `position_ids`.",
+                FutureWarning,
+            )
+        if len(deprecated_arguments) > 0:
+            raise ValueError(f"Got unexpected arguments: {deprecated_arguments}")
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        logits = self.score(hidden_states)
+        if input_ids is not None:
+            batch_size = input_ids.shape[0]
+        else:
+            batch_size = inputs_embeds.shape[0]
+        if self.config.pad_token_id is None and batch_size != 1:
+            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
+        if self.config.pad_token_id is None:
+            sequence_lengths = -1
+        else:
+            if input_ids is not None:
+                sequence_lengths = torch.ne(input_ids, self.config.pad_token_id).sum(dim=-1) - 1
+            else:
+                sequence_lengths = -1
+                logger.warning(
+                    f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
+                    "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
+                )
+        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(pooled_logits, labels)
+        if not return_dict:
+            output = (pooled_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=pooled_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+class RWForTokenClassification(RWPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"]
+    def __init__(self, config: RWConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.transformer = RWModel(config)
+        if hasattr(config, "classifier_dropout") and config.classifier_dropout is not None:
+            classifier_dropout = config.classifier_dropout
+        elif hasattr(config, "hidden_dropout") and config.hidden_dropout is not None:
+            classifier_dropout = config.hidden_dropout
+        else:
+            classifier_dropout = 0.1
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **deprecated_arguments,
+    ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        if deprecated_arguments.pop("position_ids", False) is not False:
+            # `position_ids` could have been `torch.Tensor` or `None` so defaulting pop to `False` allows to detect if users were passing explicitly `None`
+            warnings.warn(
+                "`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. You can safely ignore"
+                " passing `position_ids`.",
+                FutureWarning,
+            )
+        if len(deprecated_arguments) > 0:
+            raise ValueError(f"Got unexpected arguments: {deprecated_arguments}")
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        hidden_states = self.dropout(hidden_states)
+        logits = self.classifier(hidden_states)
+        loss = None
+        if labels is not None:
+            batch_size, seq_length = labels.shape
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(logits.view(batch_size * seq_length, self.num_labels), labels.view(batch_size * seq_length))
+        if not return_dict:
+            output = (logits,) + transformer_outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+class RWForQuestionAnswering(RWPreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.transformer = RWModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, 2)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        start_positions: Optional[torch.LongTensor] = None,
+        end_positions: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, QuestionAnsweringModelOutput]:
+        r"""
+        start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        outputs = self.transformer(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = outputs[0]
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc6957902d07d7a210613c1caa9f18f809699ee4cd0c31de0de5be119efeee39
+size 13709266611

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "clean_up_tokenization_spaces": true,
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "errors": "replace",
+  "model_max_length": 2048,
+  "pad_token": null,
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

train_results.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+    "epoch": 1.0,
+    "train_loss": 0.2576859601399347,
+    "train_runtime": 150388.7494,
+    "train_samples": 26705019,
+    "train_samples_per_second": 177.573,
+    "train_steps_per_second": 22.197
+}

trainer_state.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3e3730b770a7a69a97a94adeeead1c692defea9bd7b7bf89372f99be11a0b762
+size 4987

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff