---
license: apache-2.0
base_model: pszemraj/random-mega-ar-small-4096
tags:
- generated_from_trainer
metrics:
- accuracy
datasets:
- EleutherAI/wikitext_document_level
language:
- en
pipeline_tag: text-generation
inference:
  parameters:
    max_new_tokens: 96
    do_sample: True
    repetition_penalty: 1.05
    guidance_scale: 1.02
    eta_cutoff: 0.001
---


# mega-ar-small-4096: MWE

## mega-ar on wikitext-103-raw-v1 (document level)

This model is a fine-tuned version of [pszemraj/random-mega-ar-small-4096](https://huggingface.co/pszemraj/random-mega-ar-small-4096) on the `EleutherAI/wikitext_document_level` dataset (`wikitext-103-raw-v1`). This model has ~ 65M params.

It achieves the following results on the evaluation set:
- Loss: 4.0338
- Accuracy: 0.3243


## Training procedure

This was tuned with `bf16`, while the authors recommend tuning with `fp32`. Will try `fp32` later.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0005
- train_batch_size: 4
- eval_batch_size: 1
- seed: 80085
- gradient_accumulation_steps: 32
- total_train_batch_size: 128
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-07
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 3.0

### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:--------:|
| 7.3662        | 0.11  | 100  | 7.2782          | 0.0935   |
| 6.3064        | 0.22  | 200  | 6.2066          | 0.1634   |
| 5.8203        | 0.33  | 300  | 5.7299          | 0.1931   |
| 5.55          | 0.44  | 400  | 5.4173          | 0.2117   |
| 5.3194        | 0.55  | 500  | 5.1937          | 0.2278   |
| 5.1678        | 0.66  | 600  | 5.0206          | 0.2406   |
| 5.0375        | 0.77  | 700  | 4.8891          | 0.2508   |
| 4.9194        | 0.88  | 800  | 4.7592          | 0.2605   |
| 4.8272        | 0.99  | 900  | 4.6653          | 0.2681   |
| 4.7571        | 1.1   | 1000 | 4.5817          | 0.2754   |
| 4.6345        | 1.21  | 1100 | 4.5066          | 0.2820   |
| 4.6218        | 1.32  | 1200 | 4.4472          | 0.2867   |
| 4.5585        | 1.43  | 1300 | 4.3827          | 0.2923   |
| 4.5047        | 1.54  | 1400 | 4.3328          | 0.2963   |
| 4.4726        | 1.65  | 1500 | 4.2860          | 0.3002   |
| 4.4094        | 1.76  | 1600 | 4.2452          | 0.3038   |
| 4.3705        | 1.87  | 1700 | 4.2168          | 0.3062   |
| 4.3739        | 1.98  | 1800 | 4.1852          | 0.3095   |
| 4.2836        | 2.09  | 1900 | 4.1599          | 0.3112   |
| 4.302         | 2.2   | 2000 | 4.1307          | 0.3149   |
| 4.2847        | 2.31  | 2100 | 4.1113          | 0.3165   |
| 4.2348        | 2.42  | 2200 | 4.0925          | 0.3184   |
| 4.2837        | 2.53  | 2300 | 4.0743          | 0.3207   |
| 4.2058        | 2.64  | 2400 | 4.0612          | 0.3217   |
| 4.22          | 2.75  | 2500 | 4.0494          | 0.3224   |
| 4.1827        | 2.86  | 2600 | 4.0397          | 0.3237   |
| 4.1967        | 2.97  | 2700 | 4.0338          | 0.3243   |


### Framework versions

- Transformers 4.32.1
- Pytorch 2.1.0.dev20230727+cu118
- Datasets 2.13.1
- Tokenizers 0.13.3