initial commit

Browse files

Files changed (7) hide show

README.md +69 -0
config.json +38 -0
pytorch_model.bin +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +1 -0
tokenizer_config.json +1 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,69 @@

+## XLM-R Longformer Model
+XLM-R Longformer is a XLM-R model, that has been extended to allow sequence lengths up to 4096 tokens, instead of the regular 512. The model was pre-trained from the XLM-RoBERTa checkpoint using the Longformer [pre-training scheme](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) on the English WikiText-103 corpus.
+The reason for this was to investigate methods for creating efficient Transformers for low-resource languages, such as Swedish, without the need to pre-train them on long-context datasets in each respecitve language. The trained model came as a result of a master thesis project at [Peltarion](https://peltarion.com/) and was fine-tuned on multilingual quesion-answering tasks, with code available [here](https://github.com/MarkusSagen/Master-Thesis-Multilingual-Longformer#xlm-r).
+Since both XLM-R model and Longformer models are large models, it it recommended to run the models with NVIDIA Apex (16bit precision), large GPU and several gradient accumulation steps.
+## How to Use
+The model can be used as expected to fine-tune on a downstream task.
+For instance for QA.
+```python
+import torch
+from transformers import AutoModel, AutoTokenizer
+MAX_SEQUENCE_LENGTH = 4096
+MODEL_NAME_OR_PATH = "markussagen/xlm-roberta-longformer-base-4096"
+tokenizer = AutoTokenizer.from_pretrained(
+    MODEL_NAME_OR_PATH,
+    max_length=MAX_SEQUENCE_LENGTH,
+    padding="max_length",
+    truncation=True,
+)
+model = AutoModelForQuestionAnswering.from_pretrained(
+    MODEL_NAME_OR_PATH,
+    max_length=MAX_SEQUENCE_LENGTH,
+)
+```
+## Training Procedure
+The model have been trained on the WikiText-103 corpus, using a **48GB** GPU with the following training script and parameters. The model was pre-trained for 6000 iterations and took ~5 days. See the full [training script](https://github.com/MarkusSagen/Master-Thesis-Multilingual-Longformer/blob/main/scripts/finetune_qa_models.py) and [Github repo](https://github.com/MarkusSagen/Master-Thesis-Multilingual-Longformer) for more information
+```sh
+wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
+unzip wikitext-103-raw-v1.zip
+export DATA_DIR=./wikitext-103-raw
+scripts/run_long_lm.py \
+    --model_name_or_path xlm-roberta-base \
+    --model_name xlm-roberta-to-longformer \
+    --output_dir ./output \
+    --logging_dir ./logs \
+    --val_file_path $DATA_DIR/wiki.valid.raw \
+    --train_file_path $DATA_DIR/wiki.train.raw \
+    --seed 42 \
+    --max_pos 4096 \
+    --adam_epsilon 1e-8 \
+    --warmup_steps 500 \
+    --learning_rate 3e-5 \
+    --weight_decay 0.01 \
+    --max_steps 6000 \
+    --evaluate_during_training \
+    --logging_steps 50 \
+    --eval_steps 50 \
+    --save_steps 6000  \
+    --max_grad_norm 1.0 \
+    --per_device_eval_batch_size 2 \
+    --per_device_train_batch_size 1 \
+    --gradient_accumulation_steps 64 \
+    --overwrite_output_dir \
+    --fp16 \
+    --do_train \
+    --do_eval
+```

config.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "_name_or_path": "/workspace/models/xlm-roberta-base-4096-seed-42-fastest-lm-complete",
+  "architectures": [
+    "LongModelForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "attention_window": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 4098,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "type_vocab_size": 1,
+  "vocab_size": 250002
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6424e018ecc504ad17de8a41f749f442696b4df17e1d893c6d3befdd9754627a
+size 1124321824

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": "<mask>"}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": "<mask>", "model_max_length": 4096, "name_or_path": "xlm-roberta-base"}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5981796ba7fa7515e0a01f5f9877710858d1d660e6e4669ccda4356a5224e278
+size 1903