transfer models from org to user account

Browse files

Files changed (7) hide show

README.md +79 -3
config.json +29 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
spiece.model +3 -0
tf_model.h5 +3 -0
tokenizer_config.json +1 -0

README.md CHANGED Viewed

@@ -1,3 +1,79 @@
----
-license: mit
----

+---
+language: ar
+datasets:
+- oscar
+- wikipedia
+tags:
+- ar
+- masked-lm
+---
+# Arabic-ALBERT Large
+Arabic edition of ALBERT Large pretrained language model
+_If you use any of these models in your work, please cite this work as:_
+```
+@software{ali_safaya_2020_4718724,
+  author       = {Ali Safaya},
+  title        = {Arabic-ALBERT},
+  month        = aug,
+  year         = 2020,
+  publisher    = {Zenodo},
+  version      = {1.0.0},
+  doi          = {10.5281/zenodo.4718724},
+  url          = {https://doi.org/10.5281/zenodo.4718724}
+}
+```
+## Pretraining data
+The models were pretrained on ~4.4 Billion words:
+- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
+- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
+__Notes on training data:__
+- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
+- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
+- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
+## Pretraining details
+- These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
+- Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096.
+## Models
+|  | albert-base | albert-large | albert-xlarge |
+|:---:|:---:|:---:|:---:|
+| Hidden Layers | 12 | 24 | 24 |
+| Attention heads | 12 | 16 | 32 |
+| Hidden size | 768 | 1024 | 2048 |
+## Results
+For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/)
+## How to use
+You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
+```python
+from transformers import AutoTokenizer, AutoModel
+# loading the tokenizer
+tokenizer    = AutoTokenizer.from_pretrained("kuisailab/albert-large-arabic")
+# loading the model
+model   = AutoModelForMaskedLM.from_pretrained("kuisailab/albert-large-arabic")
+```
+## Acknowledgement
+Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊

config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "architectures": [
+    "AlbertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0,
+  "bos_token_id": 2,
+  "classifier_dropout_prob": 0.1,
+  "down_scale_factor": 1,
+  "embedding_size": 128,
+  "eos_token_id": 3,
+  "gap_size": 0,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "inner_group_num": 1,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "albert",
+  "net_structure_type": 0,
+  "num_attention_heads": 16,
+  "num_hidden_groups": 1,
+  "num_hidden_layers": 24,
+  "num_memory_blocks": 0,
+  "pad_token_id": 0,
+  "type_vocab_size": 2,
+  "vocab_size": 30000
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8048eb2a232e00d0cc70baeda1f6b77f7744f13799f51da253ab7f8fd89cc6ed
+size 71388838

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": "[MASK]"}

spiece.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:40f25b5aae5c42a4089292e6616f37bca7b5a4f08608678b16ba5a55c9f050d7
+size 860481

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2bdb5f5bbd0f270130d4f287cbaded5f09f051bf7e6d7ec93ebcc4728c375d02
+size 87181048

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"special_tokens_map_file": null, "full_tokenizer_file": null}