AMR-KELEG
/

Sentence-ALDi

Text Classification

Transformers

PyTorch

bert

Model card Files Files and versions Community

AMR-KELEG commited on Feb 18

Commit

df4b780

•

1 Parent(s): 44fafaa

Update README.md

Browse files

Files changed (1) hide show

README.md +46 -1

README.md CHANGED Viewed

@@ -18,6 +18,29 @@ Sentence-ALDi (random seed: 50) | https://huggingface.co/AMR-KELEG/Sentence-ALDi
 Token-DI (random seed: 30) | https://huggingface.co/AMR-KELEG/ALDi-Token-DI-30
 Token-DI (random seed: 50) | https://huggingface.co/AMR-KELEG/ALDi-Token-DI-50
 ### Model Description
 <!-- Provide a longer summary of what this model is. -->
@@ -28,4 +51,26 @@ Token-DI (random seed: 50) | https://huggingface.co/AMR-KELEG/ALDi-Token-DI-50
 <!--- **License:** [More Information Needed] -->
 - **Finetuned from model :** [MarBERT](https://huggingface.co/UBC-NLP/MARBERT)
-More information coming soon!

 Token-DI (random seed: 30) | https://huggingface.co/AMR-KELEG/ALDi-Token-DI-30
 Token-DI (random seed: 50) | https://huggingface.co/AMR-KELEG/ALDi-Token-DI-50
+### Usage
+```
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+model_name = "AMR-KELEG/Sentence-ALDi"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+def compute_score(sentence):
+  inputs = tokenizer(sentence, return_tensors="pt")
+  outputs = model(**inputs)
+  logits = outputs.logits
+  return min(max(0, logits[0][0].item()), 1)
+if __name__ == "__main__":
+  s1 = "الطقس جيد اليوم"
+  s2 = "الجو حلو النهاردة"
+  print(s1, round(compute_score(s1), 3)) # 0
+  print(22, round(compute_score(s1), 3)) # 0.951
+```
 ### Model Description
 <!-- Provide a longer summary of what this model is. -->
 <!--- **License:** [More Information Needed] -->
 - **Finetuned from model :** [MarBERT](https://huggingface.co/UBC-NLP/MARBERT)
+### Citation
+If you find the model useful, please cite the following respective paper:
+```
+@inproceedings{keleg-etal-2023-aldi,
+    title = "{ALD}i: Quantifying the {A}rabic Level of Dialectness of Text",
+    author = "Keleg, Amr  and
+      Goldwater, Sharon  and
+      Magdy, Walid",
+    editor = "Bouamor, Houda  and
+      Pino, Juan  and
+      Bali, Kalika",
+    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
+    month = dec,
+    year = "2023",
+    address = "Singapore",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.emnlp-main.655",
+    doi = "10.18653/v1/2023.emnlp-main.655",
+    pages = "10597--10611",
+    abstract = "Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17{\%} from news articles and 83{\%} from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers{'} stylistic choices in different situations, a useful property for sociolinguistic analyses.",
+}
+```